WO2022166412A1

WO2022166412A1 - Self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement

Info

Publication number: WO2022166412A1
Application number: PCT/CN2021/137980
Authority: WO
Inventors: 许鸿斌; 周志鹏; 乔宇; 康文雄; 吴秋霞
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-02-05
Filing date: 2021-12-14
Publication date: 2022-08-11
Also published as: CN112767468B; CN112767468A

Abstract

Disclosed are a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Said method comprises: acquiring input data, and acquiring a multi-view image pair according to the input data; acquiring a photometric consistency loss by performing depth estimation processing on the multi-view image pair; acquiring a semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair; acquiring a data enhancement consistency loss by performing data enhancement processing on the multi-view image pair; constructing a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss; and constructing and training a neural network model according to the loss function, and acquiring a three-dimensional model corresponding to the input data on the basis of the neural network model. By introducing a semantic cue and embedding a data enhancement mechanism, the reliability of a self-supervised signal under noise disturbance is enhanced, the precision and performance of the self-supervised algorithm are improved, the cost is low, the generalization is high, and the application scene is wide.

Description

Self-supervised 3D reconstruction method and system based on collaborative segmentation and data augmentation

technical field

The present invention relates to the field of image processing, in particular, to a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement.

Background technique

The 3D reconstruction method based on multi-view stereo vision (MVS) aims to restore the 3D structure of the scene by taking natural images and camera positions from multiple pre-given perspectives. Although traditional 3D reconstruction methods can effectively reconstruct 3D models in general scenarios, due to the limitations of traditional measurement methods, in many cases, traditional 3D reconstruction algorithms can only reconstruct a relatively sparse point cloud, which loses a lot of details. In addition, it is also easily interfered by factors such as noise and lighting.

With the rapid development of deep learning, more and more researchers have begun to apply it in the field of 3D reconstruction. With the help of the powerful feature extraction capability of deep convolutional neural network (CNN), these learning-based methods project the feature maps extracted by CNN to the same reference perspective through homography mapping, and build on several The matching error volume (CV) between these views at depth. The matching error body predicts the depth map at the reference view. The depth maps from each viewpoint are fused together to reconstruct the 3D information of the entire scene. Such data-driven 3D reconstruction methods, such as MVSNet, R-MVSNet, and Point-MVSNet, have achieved better results than traditional 3D reconstruction methods.

However, these methods are highly dependent on the available large-scale 3D datasets, and it is difficult to achieve good results without sufficient labeled samples. In addition, for 3D reconstruction, it is difficult and expensive to obtain accurate ground-truth sample labels. As a result, a series of un/self-supervised 3D reconstruction methods have been derived, aiming to replace a large number of expensive ground-truth labels with artificially designed self-supervised signals to train deep 3D reconstruction networks.

The depth estimation problem in the 3D reconstruction pipeline of these self-supervised methods is transformed into the image reconstruction problem to design self-supervised signals. The depth map and multi-view image predicted by the network are projected to the same view through homography, and the pixel value is calculated based on bilinear interpolation to ensure the differentiable property of the reconstructed image. The self-supervised loss then estimates the difference between the reconstructed image and the original, training the network until convergence. Unsup_MVS sorts and filters out unreliable self-supervised signals based on the correlation of matching features between views; MVS ² adds a model for adaptively judging occlusion relationships based on the original image reprojection self-supervised signals; M ³ VSNet introduces The normal vector information assists self-supervised training and achieves a certain performance improvement. Although the current unsupervised/self-supervised 3D reconstruction technology has made many progress, it still has a certain gap with the supervised 3D reconstruction method.

To sum up, although the existing unsupervised/self-supervised 3D reconstruction methods can achieve certain results, there is still a large gap compared with the supervised 3D reconstruction methods in the same situation. This also makes unsupervised 3D reconstruction methods less reliable.

Therefore, there is a need for an un/self-supervised 3D reconstruction method that can solve the above problems.

SUMMARY OF THE INVENTION

Based on the problems existing in the prior art, the present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. The specific plans are as follows:

A self-supervised 3D reconstruction method based on collaborative segmentation and data augmentation, comprising:

Image pair acquisition: acquire input data, and acquire multi-view image pairs with overlapping areas and similar viewing angles according to the input data;

Depth estimation processing: obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair;

Collaborative segmentation processing: obtain semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair;

Data enhancement processing: obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;

Constructing a loss function: constructing a loss function according to the photometric consistency loss, the semantic consistency loss, and the data augmentation consistency loss;

Model output: construct and train a neural network model according to the loss function, and obtain a three-dimensional model corresponding to the input data based on the neural network model.

In a specific embodiment, the cooperative segmentation process specifically includes:

Obtaining a collaboratively segmented image: performing collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;

Cross-entropy loss acquisition: obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view to obtain a re-projected co-segmented image, and calculate the collaboration between the re-projected co-segmented image and the reference view Cross-entropy loss between segmented images;

Semantic consistency loss acquisition: The semantic consistency loss is obtained according to the cross-entropy loss.

In a specific embodiment, the depth estimation process specifically includes:

Perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;

obtaining a reference viewing angle and a non-reference viewing angle, reconstructing the depth image on the non-reference viewing angle to obtain a reprojected view image, and calculating a regression loss according to the reprojected view image;

The photometric consistency loss is obtained from the regression loss.

In a specific embodiment, the data enhancement process specifically includes:

Using different data enhancement strategies to perform data enhancement on the multi-view image pair;

Using the depth image as a pseudo-label to supervise the data-enhanced multi-view image pair to obtain data loss under different data enhancement strategies;

A data enhancement consistency loss is obtained according to the data loss.

In a specific embodiment, the acquisition of the image pair specifically includes:

obtaining input data, the input data including images or videos;

Determine whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;

obtaining a pair of multi-view images with similar viewing angles and having the same area in the multi-view images;

Image preprocessing is performed on the multi-view image pair.

In a specific embodiment, the "acquiring a pair of multi-view images with similar viewing angles and having the same area in the multi-view images" further includes:

Perform feature matching on the multi-view image by using two-dimensional scale-invariant image features to obtain the matching degree of image features;

The degree of overlapping of viewing angles between the images is calculated according to the matching degree, and the overlapping degrees of viewing angles are sorted to obtain a pair of multi-view images with similar viewing angles and having the same area.

In a specific embodiment, the cooperatively segmented image acquisition specifically includes:

Perform feature extraction on each image in the multi-view image pair by using a convolutional neural network to obtain a feature map tensor of each view, and the feature map tensors of all views form a feature map matrix;

Perform non-negative matrix decomposition on the feature map matrix through chain iteration to obtain a first non-negative matrix and a second non-negative matrix;

Convert the first non-negative matrix into a format corresponding to the image dimension to obtain a collaboratively segmented image.

In a specific embodiment, the expression of the feature map matrix is:

A∈R ^V×H×W×C

The expressions of the first non-negative matrix and the second non-negative matrix are respectively:

P∈R ^V×H×W×K , Q∈R ^C×K

The expression of the collaboratively segmented image is:

S∈R ^V×H×W×K

Among them, A is the feature map matrix, S is the co-segmented image, P is the first non-negative matrix, Q is the second non-negative matrix, V is the total number of viewing angles, H and W are the image Height and width, C is the number of channels of the convolutional layer in the convolutional neural network, K is the number of columns of the first non-negative matrix P in the non-negative matrix decomposition process, and is also the second non-negative matrix Q The number of rows, R is a real number.

In a specific embodiment, the obtaining of the cross-entropy loss specifically includes:

Selecting a reference viewing angle from all viewing angles, and viewing angles other than the reference viewing angle are non-reference viewing angles, and obtaining a collaboratively segmented image under the reference viewing angle and a collaboratively segmented image under the non-reference viewing angle;

Calculate the corresponding relationship between the pixels at the same position under the reference viewing angle and the non-reference viewing angle respectively according to the homography formula;

Based on the homography mapping formula and the bilinear interpolation strategy, the collaboratively segmented image under the non-reference viewing angle is projected to the reference viewing angle for reconstruction, and the reprojected collaboratively segmented image is obtained;

A cross-entropy loss between the reprojected co-segmented image and the co-segmented image in the reference view is calculated.

In a specific embodiment, the expressions of the co-segmented image under the reference view and the co-segmented image under the non-reference view are respectively:

S ₁ ∈? ^H×W×K , S _i ∈? ^H×W×K

Wherein, S ₁ is the collaboratively segmented image under the reference viewing angle, S _i is the collaboratively segmented image under the non-reference viewing angle, V is the total number of viewing angles, H and W are the height and width of the image, and K represents the first The number of columns of the non-negative matrix P is also the number of rows of the second non-negative matrix Q, i is the non-reference viewing angle, 2≤i≤V;

The corresponding relationship expression is:

The reprojected co-segmentation image

The expression is:

Among them, p _j is the position of the pixel under the reference viewing angle,

is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image, D represents the depth map predicted by the network,

Co-segment the image for the reprojection.

In a specific embodiment, the expression of the cross entropy loss is:

f(S _1,j )=onehot(argmax(S _1,j ))

The semantic consistency error expression is:

Among them, f(S _1,j ) is the cross-entropy loss, L _SC is the semantic consistency error, M _i represents the effective area mapped from the non-reference view homography to the reference view, and N is a natural number Set, j represents the index value of the pixel in the image, H and W are the height and width of the image,

For the reprojected collaboratively segmented image, S1 is the collaboratively segmented image under the reference viewing angle, and i is _a non-reference viewing angle.

In a specific embodiment, the data augmentation strategy includes random occlusion masks, gamma correction, color perturbation, and random noise.

In a specific embodiment, the expression of the data enhancement consistency loss is:

Among them, the data _augmentation consistency loss described in LDA, the data augmentation function

is the random occlusion mask,

for the gamma correction,

for the color perturbation and random noise,

represents the random occlusion mask

The binary non-occlusion effective area mask in , D is the depth map.

A self-supervised 3D reconstruction system based on collaborative segmentation and data augmentation, comprising:

an input unit for acquiring input data, and acquiring pairs of multi-view images with overlapping regions and similar viewing angles according to the input data;

a depth processing unit, configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair,

A dual-branch processing unit includes a collaborative segmentation unit and a data enhancement unit. The collaborative segmentation unit and the data enhancement unit run in parallel, and the collaborative segmentation unit is used to obtain semantic consistency by performing collaborative segmentation processing on the multi-view image pair. loss; the data enhancement unit is configured to obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;

a loss function construction unit, configured to construct a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss;

An output unit, configured to construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.

In a specific embodiment, the input unit includes:

an input data acquisition unit for acquiring input data, the input data including images or videos;

A conversion unit, configured to judge whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;

a screening unit, configured to acquire pairs of multi-view images with similar viewing angles and having the same area according to the multi-view images;

A preprocessing unit, configured to perform image preprocessing on the multi-view image pair.

In a specific embodiment, the cooperative segmentation unit includes:

a segmented image acquisition unit, configured to perform collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;

A cross-entropy loss acquisition unit is used to acquire a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image the cross-entropy loss with the co-segmented image on the reference view;

The semantic loss obtaining unit is configured to obtain the semantic consistency loss according to the cross entropy loss.

In a specific embodiment, the deep processing unit includes:

a depth image acquisition unit, configured to perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;

a regression loss acquisition unit, configured to acquire a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate a regression loss according to the reprojection view image;

The photometric loss obtaining unit is configured to obtain the photometric consistency loss according to the regression loss.

In a specific embodiment, the data enhancement unit includes:

a data processing unit, configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies;

a data loss obtaining unit, configured to supervise the multi-view loss image pair after the data enhancement processing by using the depth image as a pseudo-label, and obtain the data loss under different data enhancement strategies;

The data consistency loss obtaining unit is configured to obtain the data enhancement consistency loss according to the data loss.

The present invention has the following beneficial effects:

The present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Aiming at the brightness consistency ambiguity problem, abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the reliability of the self-supervised signal under noise disturbance.

The self-supervised training method proposed by the present invention surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods.

Based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering.

The data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set.

The whole process does not require any label data and does not rely on the true value labeling. Instead, effective information is mined from the data itself to implement network training, which greatly saves costs and shortens the reconstruction process.

Integrating depth prediction, collaborative segmentation and data enhancement together, on the basis of solving the problem of video memory overflow, the accuracy of the self-supervised signal is improved, so that this embodiment has better generalization.

In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, preferred embodiments are given below, and are described in detail as follows in conjunction with the accompanying drawings.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present invention, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.

1 is a flowchart of a self-supervised three-dimensional reconstruction method according to Embodiment 1 of the present invention;

Fig. 2 is the input data processing flow chart of Embodiment 1 of the present invention;

3 is a flowchart of depth estimation processing in Embodiment 1 of the present invention;

4 is a schematic diagram of a depth estimation process according to Embodiment 1 of the present invention;

5 is a flowchart of the collaborative segmentation process in Embodiment 1 of the present invention;

6 is a schematic diagram of a collaborative segmentation process in Embodiment 1 of the present invention;

Fig. 7 is the data enhancement processing flow chart of Embodiment 1 of the present invention;

8 is a schematic diagram of a data enhancement process according to Embodiment 1 of the present invention;

Fig. 9 is the experimental detection result diagram of the embodiment of the present invention 1;

10 is a three-dimensional reconstruction result diagram of Embodiment 1 of the present invention;

11 is another three-dimensional reconstruction result diagram of Embodiment 1 of the present invention;

12 is a system module diagram of Embodiment 2 of the present invention;

FIG. 13 is a specific structural diagram of a system according to Embodiment 2 of the present invention.

Reference number:

1-input unit; 2-depth processing unit; 3-dual branch processing unit; 4-loss function construction unit; 5-output unit; 11-input data acquisition unit; 12-transformation unit; 13-screening unit; 14-pre-processing processing unit; 21-depth image acquisition unit; 22-regression loss acquisition unit; 23-luminosity loss acquisition unit; 31-cooperative segmentation unit; 311-segmented image acquisition unit; 312-cross entropy loss acquisition unit; 313-semantic loss acquisition unit; 32 - data enhancement unit; 321 - data processing unit; 322 - data loss acquisition unit; 323 - data consistency loss acquisition unit.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Existing self-supervised 3D reconstruction algorithms often directly project images from different perspectives to the reference perspective through the predicted depth map. If the depth map is reliable enough, the reprojected reconstructed image should be as close as possible to the original image of the actual reference perspective. resemblance. In this process, by default the entire scene is subject to the Color constancy hypothesis, that is, matching points from different viewing angles have the same color. However, in a real scene, the multi-view images captured by the camera will inevitably have various interference factors, such as illumination, noise, etc., which lead to differences in the color distribution of matching points of different viewing angles. In this case, however, the Color constancy hypothesis is no longer valid, resulting in the self-supervision signal itself no longer valid. Finally, in the whole training process, unreliable self-supervised signals cannot play a good role in supervision, resulting in models trained by self-supervised methods that are inevitably different from supervised methods. This problem is called the luminance consistency ambiguity problem. If only regular training is performed, due to the ambiguity of brightness consistency, the model will be blurred in edge areas, and there will be over-smoothing problems in many areas. Only in the case of a large amount of data, or in relatively ideal scenarios, can conventional self-supervised training not be affected by the brightness consistency ambiguity problem and achieve comparable results.

The luminance consistency ambiguity problem is the core problem in unsupervised/self-supervised 3D reconstruction methods. Therefore, the limitation of un/self-supervised 3D reconstruction methods can be overcome only by solving the problem of brightness consistency ambiguity.

Aiming at the problem of brightness consistency ambiguity, the present invention proposes a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Reliability under noise disturbance can not only solve the problems of traditional 3D reconstruction methods such as loss of details, easy to receive noise and light interference, over-reliance on training data, etc., but also solve the defects of conventional un/self-supervised 3D reconstruction methods, surpassing the traditional The unsupervised/self-supervised method can achieve comparable results with some efficient supervised methods, and the whole process does not require any annotation.

Experiments show that the self-supervised 3D reconstruction method provided by the present invention surpasses the traditional unsupervised 3D reconstruction method on the DTU data set, and can achieve an effect comparable to the state-of-the-art supervised method. In addition, without any fine-tuning, the unsupervised training model finally obtained by the present invention can be directly applied to the Tanks&Temples dataset, which can also surpass the traditional unsupervised method. Since the Tanks&Temples dataset itself contains a large number of illumination changes of special natural scenes, it shows from the side that the present invention has better generalization than other unsupervised methods.

It should be noted that the present invention is as close as possible to the lighting effect in the real scene when collecting sample data, restores the noise interference and color disturbance in various scenes, and simulates various natural scenes as much as possible, so the sample has a strong effect. representative. However, the present invention can be applied to various generalized scenarios, and has stronger pertinence and wider scope of application than conventional self-supervised three-dimensional reconstruction methods.

It should be noted that the reference views in this application include the same reference views used in the depth estimation process, the collaborative segmentation process and the data enhancement process. Generally speaking, among the N multi-views, a multi-view pair must be constructed once for each viewpoint, and the multi-view pair is constructed according to which viewpoint, and which viewpoint is the reference viewpoint. Finally, there will be N multi-view pairs.

Example 1

This embodiment proposes a self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement, as shown in Figures 1-11 of the specification. The process steps are as shown in accompanying drawing 1 of the description, and the specific scheme is as follows:

S1, obtain input data, and obtain multi-view image pairs with overlapping regions and similar viewing angles according to the input data;

S2. Obtain the loss of photometric consistency by performing depth estimation processing on the multi-view image pair;

S3. Obtain the semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair, obtain the data enhancement consistency loss by performing data enhancement processing on the multi-view image pair, and run the collaborative segmentation processing and the data enhancement processing in parallel;

S4. Construct a loss function according to the loss of photometric consistency, the loss of semantic consistency and the loss of data enhancement consistency;

S5. Construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.

In this embodiment, step S1 acquires input data, and acquires multi-view image pairs having overlapping regions and similar viewing angles according to the input data. The process of step S1 is shown in Figure 2 of the specification, and specifically includes:

S11. Obtain input data, where the input data includes images or videos;

S12, determine whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;

S13. Acquire pairs of multi-view images with similar viewing angles and having the same area according to the multi-view images;

S14. Perform image preprocessing on the multi-view image pair.

Specifically, the data collection of the original multi-view image can be completed by capturing images with any camera at various viewing angles or directly capturing a video while the camera is moving. The input data in this embodiment may be images or videos, or It can be an image combined with a video. If it is an image, it is only necessary to extract multi-view images from the input data, filter out multi-view image pairs with similar viewing angles and the same area from the multi-view images, and finally enhance the image quality through basic image preprocessing techniques such as image filtering. Yes; if it is a video, you need to convert the video into a multi-view image first, and filter out multi-view image pairs with similar viewing angles and the same area from the multi-view images, and then perform image preprocessing.

In particular, the step S13 of selecting a pair of multi-view images specifically includes: performing feature matching on the multi-view images by using two-dimensional scale-invariant image features, and obtaining matching information of pixels and matching degrees of image features;

The camera extrinsic parameter matrix is obtained according to the matching information, the degree of overlapping of viewing angles between the images is calculated according to the degree of matching, and the overlapping degrees of viewing angles are sorted to obtain other viewing angles that are close to each viewing angle.

Specifically, after acquiring the multi-view image pairs, feature matching is performed between all multi-view image pairs by two-dimensional scale-invariant image features such as SIFT, ORB, and SURF. Relying on the two-dimensional image pixel point matching information, the beam method adjustment problem between all cameras is solved, and the relative pose relationship between different cameras is calculated, that is, the camera extrinsic parameter matrix. In addition, the degree of coincidence of viewing angles between all pairs of image pairs is calculated according to the matching degree of the image features. Sort according to the degree of overlap, and obtain the top 10 viewing angles that are closest to the viewing angle among all remaining viewing angles for each viewing angle. Thereby, the multi-view of N viewing angles can be divided into N groups of multi-view image pairs, which are used for the subsequent stereo vision matching process.

In this embodiment, the multi-view image pair generally includes 3-7 multi-view images, and selecting multi-view image pairs with similar viewing angles and overlapping regions can facilitate subsequent feature matching. It should be noted that if the viewing angle difference is too large and the overlapping area is too small, the effective area will be very small when the subsequent process finds matching points, which will affect the progress of the process.

In this embodiment, S2 obtains the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair, and the specific process is shown in FIG. 3 of the specification, including:

S21. Perform depth estimation on a multi-view image based on a depth estimation network to obtain a depth image;

S22, obtaining a reference viewing angle and a non-reference viewing angle, reconstructing the depth image on the non-reference viewing angle through homography to obtain a reprojected viewing image, and calculating a regression loss according to the reprojecting viewing image;

S23. Obtain the photometric consistency loss according to the regression loss.

Depth estimation processing is a commonly used technical means in existing 3D reconstruction methods. The specific process includes: inputting the multi-view image pair and the reference view to the depth estimation network for depth estimation, obtaining a depth map, performing homography mapping between the depth map and the multi-view image pair, and reconstructing the depth image from the non-reference view. The reprojected view image is obtained, and the regression loss, ie the L2 loss, can be obtained by calculating the difference between the reprojected view and the reference view, and the photometric consistency error is obtained based on the L2 loss. The specific principle is shown in Figure 4 of the description.

In this embodiment, S3 obtains the semantic consistency loss by performing collaborative segmentation processing on the multi-view image pairs, and obtains the data enhancement consistency loss by performing data enhancement processing on the multi-view image pairs, and the collaborative segmentation processing and the data enhancement processing are performed in parallel run. Step S3 is the core step of this embodiment. The two branches of the collaborative segmentation processing and the data enhancement processing are run in parallel to obtain the semantic consistency loss and the data enhancement consistency loss.

Among them, based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering, without the need for ground-truth labels, and can be generalized to any scene. Supervised clustering without relying on human-defined semantic categories. However, the existing solutions based on semantic consistency often need to obtain semantic annotations through a large number of manual annotations, which is very costly; in addition, these methods are also limited to specific scenarios and specific artificially defined semantic categories, and cannot be applied to any scene. The specific process of collaborative segmentation processing is shown in Figure 5 of the description, including:

S311. Perform collaborative segmentation on the multi-view image pair by using a non-negative matrix to obtain a collaboratively segmented image;

S312: Obtain a reference perspective and a non-reference perspective, reconstruct the collaboratively segmented image on the non-reference perspective through homography to obtain a reprojected collaboratively segmented image, and calculate the difference between the reprojected collaboratively segmented image and the collaboratively segmented image on the reference perspective The cross entropy loss of ;

S313. Obtain the semantic consistency loss according to the cross-entropy loss.

The collaborative segmentation processing flow includes: inputting the reference view and multi-view image pair into the pre-trained VGG network, and then performing non-negative matrix decomposition to obtain the collaborative segmentation image in the reference perspective and the collaborative segmentation image in the non-reference perspective. The co-segmented images in the perspective are homographed to obtain the re-projected co-segmented images, and the cross-entropy loss between the re-projected co-segmented images and the co-segmented images in the reference perspective is calculated, and then the semantic consistency error is obtained. The specific process is shown in Figure 6 of the description.

In this embodiment, the cooperative segmentation process is similar to the depth estimation process of step S2. The reference view and multi-view image pairs are input to a pretrained convolutional neural network. In particular, each image in the multi-view image pair will be fed into a weighted convolutional neural network to extract features, preferably, the convolutional neural network is a VGG network pre-trained by ImageNet. Thus, the image of each view will get a corresponding feature map tensor, and the dimension of the feature map tensor is H×W×C, where H and W are the height and width of the image, and C is the convolutional neural network. The number of channels in the convolutional layer. The feature map tensors of all views are expanded and taken together to form a two-dimensional matrix, that is, the feature map matrix A∈R ^V×H×W×C , whose dimension is V×H×W×C, where V is the total view number. The feature map matrix is subjected to non-negative matrix decomposition through chain iteration to obtain a first non-negative matrix P and a second non-negative matrix Q. The expressions of the first non-negative matrix P and the second non-negative matrix Q are respectively:

P∈R ^V×H×W×K , Q∈R ^C×K

K represents the number of columns of the P matrix in the non-negative matrix decomposition process, and is also the number of rows of the Q matrix. Due to the orthogonal constraint assumption of non-negative matrices, it is required that the Q matrix in it must satisfy the following conditions: QQ ^T =I, where I is the identity matrix. Due to the constraints of orthogonality, each row vector of the Q matrix needs to contain as much information as possible of the A matrix at the same time, and keep it as disjoint as possible. In other words, each row vector of the Q matrix can be approximately regarded as the cluster center of the cluster, and the process of solving the non-negative matrix decomposition can also be regarded as the process of clustering. Correspondingly, the P matrix represents the correlation degree of each pixel of all multi-view images with respect to the semantic cluster center (the vector of each row of the Q matrix), that is, the segmentation confidence. In this way, the collaborative segmentation of multi-view images is realized without relying on any supervision signal, and the common semantic information of multi-view images is extracted. The schematic diagram of the non-negative matrix decomposition to achieve collaborative segmentation and extraction of common semantic information is shown in the accompanying drawings of the description.

Convert the first non-negative matrix to a format corresponding to the image dimension to obtain a collaboratively segmented image. The expression of the collaboratively segmented image S is:

S∈R ^V×H×W×K

Among them, V is the total number of viewing angles, H and W are the height and width of the image, and are also the number of rows of the second non-negative matrix Q, and R is a real number.

It should be noted that, in the cooperative segmentation branch, in order to take into account the amount of calculation and efficiency, only a relatively simple traditional scheme is adopted to perform the cooperative segmentation task. However, in the field of collaborative segmentation, there are actually many alternative solutions. In this embodiment, other clustering algorithms can be used to perform the collaborative segmentation task to achieve a comparable effect.

In particular, when non-negative matrix factorization is implemented, the solution often fails when dealing with multiple views of real scenes due to the flaws in the method itself. This problem is largely due to the fact that the iterative solution process is highly dependent on the random initialization state value. Once a good initial value is not encountered, the solution of the entire non-negative matrix decomposition cannot be converged, and the cooperative segmentation will also fail, and finally lead to The entire training process cannot be performed. In this embodiment, the original iterative solution process is extended to a multi-branch parallel solution process, and multiple sets of solutions are randomly initialized each time, and the optimal solution is selected and sent to the next iteration process. To a large extent, the problem of solution failure due to bad random initialization values is avoided.

Furthermore, due to the particularity of semantic segmentation tasks, it is often necessary to define specific scenarios and possible semantic categories. However, this embodiment only needs to mine common semantic components (clusters) in different views, and no longer needs to care about specific scenes and semantic labels. Therefore, the method provided in this embodiment can be generalized to any dynamically changing scene without requiring a lot of tedious and expensive semantic annotation work like other methods.

In this embodiment, S312 specifically includes: dividing the V views into view pairs consisting of a reference view and a series of non-reference views, a collaboratively segmented image S ₁ under the reference view and a collaboratively segmented image S _i under the non-reference view The expressions are:

S ₁ ∈R ^H×W×K , S _i ∈R ^H×W×K

The viewing angle with the default sequence number 1 is the reference viewing angle, and the viewing angle with the sequence number i is defined as a non-reference viewing angle, where 2≤i≤V. According to the camera intrinsic and extrinsic parameter matrix $(K,T)$, the homography formula can calculate the pixel at the position of p _j under the reference view and the position in the source view as

The corresponding relationship of the pixels:

is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image or segmentation map, and 1≤j≤H×W, D represents the depth map predicted by the network.

Then, according to the homography mapping formula and the bilinear interpolation strategy, the co-segmented image Si in the non-reference view can be projected to the reference view to _reproject the co-segmented image

The expression is:

Co-segmentation of images by computational reprojection

The difference from the co-segmented image in the reference view, the cross entropy loss f(S _1,j ) can be obtained, and the expression of f(S _1,j ) is:

f(S _1,j )=onehot(argmax(S _1,j ))

The semantic consistency error is obtained according to the cross entropy loss, the semantic consistency error L _{SC, i} is expressed as:

Among them, M _i represents the effective area mapped from the reference viewing angle homography to the reference viewing angle.

The cross-entropy loss between the reconstructed semantic segmentation map and the original semantic segmentation map is calculated for all view pairs. If the predicted depth map is correct, the semantic segmentation map reconstructed from it should also be as similar as possible to the original semantic segmentation map. The calculation formula of the entire semantic consistency loss is as follows:

Among them, f(S _1,j ) is the cross-entropy loss, L _SC is the semantic consistency error, _Mi represents the effective area mapped from the non-reference view homography to the reference view, N is the set of natural numbers, and j represents The index value of the pixel in the image, H and W are the height and width of the image,

In order to reproject the co-segmented image, S ₁ is the co-segmented image under the reference viewing angle, and i is the non-reference viewing angle.

In this embodiment, the weight of semantic consistency loss during training is set to 0.1 by default.

Since the data augmentation operation itself causes the pixel values of multi-view images to change, directly applying the data augmentation strategy may break the luminance consistency assumption of self-supervised signals. Different from the ground-truth labels of supervised methods, self-supervised signals come from the data itself and are more susceptible to noise interference from the data itself. In order to introduce the data augmentation strategy into the self-supervised training framework, the original self-supervised training branch is expanded into a dual-stream structure, one standard branch is only supervised by photometric stereo vision self-supervised signals, while the other branch introduces various random data augmentation changes.

Among them, the data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions, respectively. Processed to introduce a large amount of data augmentation into the self-supervised signal to augment the changes in the training set. However, existing self-supervised signals based on photometric stereo consistency are often limited by the luminance consistency assumption and do not allow data augmentation operations. Because data augmentation changes the pixel distribution of the image, the luminance consistency assumption is broken, which in turn leads to luminance consistency ambiguity, making the self-supervised signal less reliable. The specific process of data enhancement processing is shown in Figure 7 of the description, including:

S321, using different data enhancement strategies to perform data enhancement processing on the multi-view image pair;

S322, using the depth image as a pseudo-label to supervise the multi-view loss image pair after data enhancement processing, and obtain the data loss under different data enhancement strategies;

S323. Obtain the data enhancement consistency loss according to the data loss.

The specific process of data enhancement processing includes: inputting the reference view and multi-view image pairs to the depth estimation network for depth estimation processing, obtaining a depth map, obtaining an effective area mask according to the depth map, and using the effective area mask as a pseudo-label. After performing random data enhancement on the reference view and multi-view image pairs, they are input to the depth estimation network for depth estimation processing to obtain the contrast depth map, and the difference between the contrast depth map and the pseudo-label is calculated, and then the data enhancement consistency loss is obtained. The principle of data enhancement processing is shown in Figure 8 of the description.

In this embodiment, data augmentation strategies include random occlusion masks, gamma correction, color perturbation, and random noise. The original multi-view is I, and the data enhancement function acting on the multi-view image pair is τ _θ , and the multi-view after data enhancement is

θ represents the parameters related to specific operations in the data augmentation process. Limited by the viewing angle constraints of multi-view geometry, the distribution of pixel positions cannot be changed, otherwise the correspondence between the calibrated cameras may be destroyed. The data enhancements used are: random occlusion masks

Gamma Correction

Color perturbation and random noise

random occlusion mask

In order to simulate the foreground occlusion scenario in multi-view, a binary mask occlusion mask can be randomly generated

A part of the area under the reference view, and

Indicates the remaining regions that are valid in prediction. and

The included area should remain invariant to occlusion changes, so the entire system should remain invariant on such artificial occlusion edges, which can guide the model to pay more attention to the processing of occlusion edges.

Gamma Correction

Gamma correction is a common data augmentation operation used to adjust image lighting. To simulate as many and complex lighting variations as possible, random gamma correction is introduced for data augmentation.

Color perturbation and random noise

Due to the brightness consistency ambiguity problem, any color perturbation will change the pixel distribution of the image, destroying the effectiveness of the stereo vision-based self-supervised loss. It is therefore difficult for self-supervised losses to remain robust in the presence of color perturbations. By randomly perturbing the RGB pixel values of the image, and adding random Gaussian noise, etc., to assist data enhancement and simulate as many perturbation changes as possible.

It should be noted that, in the data enhancement branch of this embodiment, only three data enhancement strategies are used, and all combinations of data enhancement strategies are not enumerated and an optimal data enhancement combination is pieced together. As an alternative, some special adaptive data enhancement schemes can be used.

Among them, S322 uses the depth image as a pseudo-label to supervise the multi-view image pair after data enhancement processing, and obtains the data loss under different data enhancement strategies. The data augmentation strategy needs to ensure that there is a relatively reliable reference standard. In supervised training, this reference standard is often the invariance of random data augmentation to the true value label, but in self-supervised training, this assumption cannot be established, because the true value cannot be obtained. value label. Therefore, in this embodiment, the depth map of the branch prediction of the standard self-supervised training is used as the false true value label, that is, the depth map in the depth estimation process in step S2. Immutability. This operation decouples data augmentation from the self-supervised loss without affecting the brightness consistency assumption of the self-supervised loss.

Several data enhancement strategies in step S321 can be combined to obtain a comprehensive data enhancement function:

The depth map for standard self-supervised branch prediction (depth estimation process) is D, and the depth map for data augmentation branch prediction is

To calculate the data augmentation consistency loss, the expression of the data augmentation consistency loss _LDA is:

in,

is a random occlusion mask,

for gamma correction,

for colored detours and random noise,

represents a random occlusion mask

The binary non-occlusion valid area mask in ,

and

is the dot product, and D is the depth map predicted by the network.

During the training process, different random data augmentation strategies are applied to each image each time, and then the loss L _DA is calculated using the above formula. In addition, since the data augmentation loss requires that the overall training process has converged, if the weight of the data augmentation loss in the early stage of training is too large, it may cause the self-supervised training to fail to converge. To this end, the influence weights of the data augmentation loss are adaptively adjusted according to the training progress. The weight is 0.01 at the beginning, and then doubles every two epochs. The data augmentation loss only plays a substantial role after the network has converged.

In particular, since the entire self-supervised training framework uses many operations, especially the data augmentation branch runs the entire network forward twice. Generally speaking, if the parallel forward-reverse update strategy is directly used, the GPU memory during the training process is not enough (11G by default), and there is an existing overflow problem. In this embodiment, a strategy of exchanging time for space is adopted for the problem of video memory overflow. The original one-step forward-computing self-supervised loss-backpropagation process is divided into two sets of forward and backpropagation processes. The self-supervised loss of the standard branch is calculated in the forward direction, then the gradient is updated by back-propagation, the cache is cleared, and the depth map result predicted by the standard branch is saved as a pseudo-label; then the self-supervised loss of the data-enhanced branch is calculated in the forward direction. its training. Since the gradient updates of multiple losses are decoupled into different stages, there is no need to occupy the video memory at the same time, which greatly reduces the video memory occupation of the GPU.

In this embodiment, S4 constructs a loss function according to the loss of photometric consistency, the loss of semantic consistency, and the loss of data enhancement consistency. The expression of the loss function L is:

L=L _PC +L _DA +L _SC

Among them, L _PC is the photometric consistency loss, L _D□ is the data enhancement consistency loss, and L _SC is the semantic consistency loss.

In this embodiment, the traditional stereo matching is replaced by the dense depth map estimation based on deep learning, a neural network model is constructed and trained according to the loss function, and the neural network model is applied to the complete three-dimensional reconstruction to obtain a three-dimensional model. It can be comparable to the method of manually annotating samples. This embodiment provides an alternative solution for training a high-precision 3D reconstruction model at a low cost, which can be extended to scenarios related to 3D reconstruction, such as map exploration, automatic driving, and AR/VR.

The method proposed in this embodiment is detected based on the DTU data set, and the experimental detection result is shown in FIG. 9 in the description. Among them, the DACS-MS proposed in the embodiment of the present invention has an average reconstruction error of 0.358 mm for each point on the DTU data set, which is much smaller than similar unsupervised methods such as MVS, MVS ² , and M ³ VSNet. Compared with supervised methods, DACS-MS is also close to the state-of-the-art supervised methods and surpasses some existing supervised methods. Experimental results show that the self-supervised method proposed in this implementation outperforms traditional unsupervised 3D reconstruction methods on the DTU dataset and can achieve comparable results to state-of-the-art supervised methods. The effect of the model reconstructed by using the self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement provided in this embodiment is shown in FIG. 10 and FIG. 11 in the description. The experimental results of this embodiment are shown in the third column of the accompanying drawings. From the specific experimental results, this embodiment can fully achieve the same or similar technical effects as the supervised method, and the reconstructed 3D model meets the technical requirements.

This embodiment provides a self-supervised 3D reconstruction method based on collaborative segmentation and data enhancement. For the brightness consistency ambiguity problem, abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the self-supervised signal under noise disturbance. reliability. The self-supervised training method proposed in this embodiment surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods. Based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering. The data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set. The whole process does not require any label data and does not rely on the true value labeling. Instead, effective information is mined from the data itself to implement network training, which greatly saves costs and shortens the reconstruction process. Integrating depth prediction, collaborative segmentation and data enhancement together, on the basis of solving the problem of video memory overflow, the accuracy of the self-supervised signal is improved, so that this embodiment has better generalization.

Example 2

On the basis of Embodiment 1, this embodiment modularizes a self-supervised 3D reconstruction method based on collaborative segmentation and data enhancement proposed in Embodiment 1 to form a self-supervised 3D reconstruction system based on collaborative segmentation and data enhancement, A schematic diagram of each module is shown in FIG. 12 in the specification, and a complete system structure diagram is shown in FIG. 13 in the specification.

A self-supervised three-dimensional reconstruction system based on collaborative segmentation and data enhancement includes an input unit 1, a depth processing unit 2, a dual-branch processing unit 3, a loss function construction unit 4 and an output unit 5 that are connected in sequence.

The input unit 1 is used for acquiring input data, and acquiring multi-view image pairs having overlapping regions and similar viewing angles according to the input data. The input unit includes an input data acquisition unit 11 , a conversion unit 12 , a screening unit 13 and a preprocessing unit 14 .

The depth processing unit 2 is configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair. The depth processing unit includes a depth image acquisition unit 21 , a regression loss acquisition unit 22 and a luminosity loss acquisition unit 23 .

The dual-branch processing unit 3 includes a collaborative segmentation unit 31 and a data enhancement unit 32. The collaborative segmentation unit 31 and the data enhancement unit 32 run in parallel. The collaborative segmentation unit 31 is used to obtain semantic consistency by performing collaborative segmentation processing on multi-view image pairs. loss. The data enhancement unit 32 is configured to obtain the data enhancement consistency loss by performing data enhancement processing on the multi-view image pair.

The loss function construction unit 4 is used for constructing a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss.

The output unit 5 is used for constructing and training a neural network model according to the loss function, and obtaining a three-dimensional model of the input data based on the neural network model.

The depth processing unit 2 includes a depth image acquisition unit 21 , a regression loss acquisition unit 22 and a luminance loss acquisition unit 23 . The basic principles include: inputting multi-view image pairs and reference views into the depth estimation network for depth estimation, obtaining a depth map, performing homography mapping on the depth map and multi-view image pairs, and reconstructing the depth images from non-reference views. The reprojected view image is obtained, the regression loss can be obtained by calculating the difference between the reprojected view and the reference view, and the photometric consistency error is obtained based on the regression loss. The specific structure includes:

The depth image acquisition unit 21 is configured to perform depth estimation on a multi-view image through a depth estimation network to acquire a depth image.

The regression loss obtaining unit 22 is configured to obtain a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate the regression loss according to the reprojection view image.

The photometric loss obtaining unit 23 is configured to obtain the photometric consistency loss according to the regression loss.

The collaborative segmentation unit 31 includes a segmented image acquisition unit 311 , a cross-entropy loss acquisition unit 312 and a semantic loss acquisition unit 313 . The basic principle of the collaborative segmentation unit 31 includes: inputting a reference view and a multi-view image pair into a pre-trained VGG network, and then performing non-negative matrix decomposition to obtain the collaborative segmentation image under the reference view and the collaborative segmentation image under the non-reference view, Perform homography projection on the co-segmented image in the non-reference view to obtain the re-projected co-segmented image, calculate the cross-entropy loss between the re-projected co-segmented image and the co-segmented image under the reference view, and then obtain the semantic consistency error. The specific structure includes:

The segmented image obtaining unit 311 is configured to perform cooperative segmentation on the multi-view image pair by using a non-negative matrix to obtain a cooperatively segmented image.

The cross-entropy loss obtaining unit 312 is configured to obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image and the reference view Cross-entropy loss between co-segmented images on .

The semantic loss obtaining unit 313 is configured to obtain the semantic consistency loss according to the cross entropy loss.

The data enhancement unit 32 includes a data processing unit 321 , a data loss obtaining unit 322 and a data consistency loss obtaining unit 323 . The basic principle of the data enhancement unit 32 includes: inputting the reference view and the multi-view image pair to the depth processing unit to perform depth estimation processing, obtaining a depth map, obtaining an effective area mask according to the depth map, and using the effective area mask as a pseudo-label. After random data enhancement is performed on the reference view and multi-view image pair, the depth estimation process is performed with the depth estimation network to obtain the contrast depth map, and the difference between the contrast depth map and the pseudo-label is calculated, and then the consistency loss of data enhancement is obtained. The specific structure includes:

The data processing unit 321 is configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies. The data processing unit is provided with a depth estimation network.

The data loss obtaining unit 322 is configured to supervise the multi-view loss image pair after data enhancement processing with the depth image as a pseudo-label, and obtain the data loss under different data enhancement strategies.

The data consistency loss obtaining unit 323 is configured to obtain the data enhancement consistency loss according to the data loss.

The input unit 1 includes an input data acquisition unit 11 , a conversion unit 12 , a screening unit 13 and a preprocessing unit 14 . The specific structure includes:

The input data acquisition unit 11 is used for acquiring input data, and the input data includes images or videos.

The conversion unit 12 is configured to determine whether the input data is an image: if so, select a multi-view image from the input data. If not, convert the input data to a multi-view image.

The screening unit 13 is configured to acquire, according to the multi-view images, pairs of multi-view images with similar viewing angles and having the same area.

The preprocessing unit 14 is configured to perform image preprocessing on the multi-view image pair.

On the basis of Embodiment 1, this embodiment proposes a deep learning-based sample image generation system, and modularizes the method of Embodiment 1 to form a specific system, making it more practical.

Aiming at the prior art, the present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Aiming at the brightness consistency ambiguity problem, abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the reliability of the self-supervised signal under noise disturbance. The self-supervised training method proposed in the present invention surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods. Based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering. The data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set. The whole process does not require any label data and does not rely on the true value labeling. Instead, effective information is mined from the data itself to implement network training, which greatly saves costs and shortens the reconstruction process. Integrating depth prediction, collaborative segmentation and data enhancement together, on the basis of solving the problem of video memory overflow, the accuracy of the self-supervised signal is improved, so that this embodiment has better generalization. The method is modularized to form a specific system, making it more practical.

Those of ordinary skill in the art should understand that the above-mentioned modules or steps of the present invention can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed on a network composed of multiple computing devices. Optionally, they may be implemented in program code executable by a computer device, so that they can be stored in a storage device and executed by the computing device, or they can be fabricated separately into individual integrated circuit modules, or a plurality of modules of them Or the steps are made into a single integrated circuit module to realize. As such, the present invention is not limited to any specific combination of hardware and software.

Note that the above are only preferred embodiments of the present invention and applied technical principles. Those skilled in the art will understand that the present invention is not limited to the specific embodiments herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present invention. The scope is determined by the scope of the appended claims.

The above disclosures are only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any changes that can be conceived by those skilled in the art should fall within the protection scope of the present invention.

Claims

A self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement, characterized in that it includes:

Image pair acquisition: acquire input data, and acquire multi-view image pairs with overlapping areas and similar viewing angles according to the input data;

Depth estimation processing: obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair;

Collaborative segmentation processing: obtain semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair;

Data enhancement processing: obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;

Constructing a loss function: constructing a loss function according to the photometric consistency loss, the semantic consistency loss, and the data augmentation consistency loss;

Model output: construct and train a neural network model according to the loss function, and obtain a three-dimensional model corresponding to the input data based on the neural network model.
The method according to claim 1, wherein the cooperative segmentation process specifically comprises:

Obtaining a collaboratively segmented image: performing collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;

Cross-entropy loss acquisition: obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view to obtain a re-projected co-segmented image, and calculate the collaboration between the re-projected co-segmented image and the reference view Cross-entropy loss between segmented images;

Semantic consistency loss acquisition: The semantic consistency loss is obtained according to the cross-entropy loss.
The method according to claim 1 or 2, wherein the depth estimation process specifically comprises:

Perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;

obtaining a reference viewing angle and a non-reference viewing angle, reconstructing the depth image on the non-reference viewing angle to obtain a reprojected view image, and calculating a regression loss according to the reprojected view image;

The photometric consistency loss is obtained from the regression loss.
The method according to claim 3, wherein the data enhancement processing specifically comprises:

Using different data enhancement strategies to perform data enhancement on the multi-view image pair;

Using the depth image as a pseudo-label to supervise the data-enhanced multi-view image pair to obtain data loss under different data enhancement strategies;

A data enhancement consistency loss is obtained according to the data loss.
The method according to claim 1, wherein the acquiring of the image pair specifically comprises:

obtaining input data, the input data including images or videos;

Determine whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;

obtaining a pair of multi-view images with similar viewing angles and having the same area in the multi-view images;

Image preprocessing is performed on the multi-view image pair.
The method according to claim 5, wherein the "obtaining a pair of multi-view images with similar viewing angles and having the same area in the multi-view images" further comprises:

Perform feature matching on the multi-view image by using two-dimensional scale-invariant image features to obtain the matching degree of image features;

The degree of overlapping of viewing angles between the images is calculated according to the matching degree, and the degrees of overlapping of viewing angles are sorted to obtain a pair of multi-view images with similar viewing angles and having the same area.
The method according to claim 2, wherein the cooperatively segmented image acquisition specifically comprises:

Perform feature extraction on each image in the multi-view image pair by using a convolutional neural network to obtain a feature map tensor of each view, and the feature map tensors of all views form a feature map matrix;

Perform non-negative matrix decomposition on the feature map matrix through chain iteration to obtain a first non-negative matrix and a second non-negative matrix;

Convert the first non-negative matrix into a format corresponding to the image dimension to obtain a collaboratively segmented image.
The method according to claim 7, wherein the expression of the feature map matrix is:

A∈R V×H×W×C

The expressions of the first non-negative matrix and the second non-negative matrix are respectively:

P∈R V×H×W×K , Q∈R C×K

The expression of the co-segmented image is:

S∈R V×H×W×K

Among them, A is the feature map matrix, S is the co-segmented image, P is the first non-negative matrix, Q is the second non-negative matrix, V is the total number of viewing angles, H and W are the image Height and width, C is the number of channels of the convolutional layer in the convolutional neural network, K is the number of columns of the first non-negative matrix P in the non-negative matrix decomposition process, and is also the second non-negative matrix Q The number of rows, R is a real number.
The method according to claim 2 or 7, wherein the obtaining of the cross-entropy loss specifically comprises:

Selecting a reference viewing angle from all viewing angles, and viewing angles other than the reference viewing angle are non-reference viewing angles, and obtaining a collaboratively segmented image under the reference viewing angle and a collaboratively segmented image under the non-reference viewing angle;

Calculate the corresponding relationship between the pixels at the same position under the reference viewing angle and the non-reference viewing angle respectively according to the homography formula;

Based on the homography mapping formula and the bilinear interpolation strategy, the collaboratively segmented image under the non-reference viewing angle is projected to the reference viewing angle for reconstruction, and the reprojected collaboratively segmented image is obtained;

A cross-entropy loss between the reprojected co-segmented image and the co-segmented image in the reference view is calculated.
The method according to claim 9, wherein the expressions of the collaboratively segmented image under the reference viewing angle and the collaboratively segmented image under the non-reference viewing angle are respectively:

S 1 ∈R H×W×K , S i ∈R H×W×K

Wherein, S 1 is the collaboratively segmented image under the reference viewing angle, S i is the collaboratively segmented image under the non-reference viewing angle, V is the total number of viewing angles, H and W are the height and width of the image, and K represents the first The number of columns of the non-negative matrix P is also the number of rows of the second non-negative matrix Q, i is the non-reference viewing angle, 2≤i≤V;

The corresponding relationship expression is:

The reprojected co-segmentation image
The expression is:

Among them, p j is the position of the pixel under the reference viewing angle,
is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image, D represents the depth map predicted by the network,
Co-segment the image for the reprojection.
The method according to claim 10, wherein the expression of the cross entropy loss is:

f(S 1,j )=onehot(argmax(S 1,j ))

The semantic consistency error expression is:

Among them, f(S 1,j ) is the cross-entropy loss, L SC is the semantic consistency error, M i represents the effective area mapped from the non-reference view homography to the reference view, and N is a natural number Set, i is the non-reference viewing angle, j is the index value of the pixel in the image, H and W are the height and width of the image,
For the reprojected collaboratively segmented image, S1 is the collaboratively segmented image under the reference viewing angle.
The method of claim 4, wherein the data augmentation strategy includes random occlusion masks, gamma correction, color perturbation, and random noise.
The method according to claim 12, wherein the expression of the data enhancement consistency loss is:

Among them, L DA is the data enhancement consistency loss, and the data enhancement function

is the random occlusion mask,
for the gamma correction,
for the color perturbation and random noise,
represents the random occlusion mask
The binary non-occlusion effective area mask in , D is the depth map.
A self-supervised three-dimensional reconstruction system based on collaborative segmentation and data enhancement, characterized in that it includes:

an input unit for acquiring input data, and acquiring pairs of multi-view images with overlapping regions and similar viewing angles according to the input data;

a depth processing unit, configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair,

A dual-branch processing unit includes a collaborative segmentation unit and a data enhancement unit. The collaborative segmentation unit and the data enhancement unit run in parallel, and the collaborative segmentation unit is used to obtain semantic consistency by performing collaborative segmentation processing on the multi-view image pair. loss; the data enhancement unit is configured to obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;

a loss function construction unit, configured to construct a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss;

An output unit, configured to construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.
The system of claim 14, wherein the input unit comprises:

an input data acquisition unit for acquiring input data, the input data including images or videos;

A conversion unit, configured to judge whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;

a screening unit, configured to acquire pairs of multi-view images with similar viewing angles and having the same area according to the multi-view images;

A preprocessing unit, configured to perform image preprocessing on the multi-view image pair.
The system according to claim 14 or 15, wherein the cooperative segmentation unit comprises:

a segmented image acquisition unit, configured to perform collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;

A cross-entropy loss acquisition unit is used to acquire a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image the cross-entropy loss with the co-segmented image on the reference view;

The semantic loss obtaining unit is configured to obtain the semantic consistency loss according to the cross entropy loss.
The system of claim 16, wherein the deep processing unit comprises:

a depth image acquisition unit, configured to perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;

a regression loss acquisition unit, configured to acquire a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate a regression loss according to the reprojection view image;

The photometric loss obtaining unit is configured to obtain the photometric consistency loss according to the regression loss.
The system of claim 17, wherein the data enhancement unit comprises:

a data processing unit, configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies;

A data loss acquisition unit, configured to supervise the multi-view loss image pair processed by the data processing unit by using the depth image as a pseudo-label to acquire data loss under different data enhancement strategies;

The data consistency loss obtaining unit is configured to obtain the data enhancement consistency loss according to the data loss.