WO2022166412A1 - Self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement - Google Patents

Self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement Download PDF

Info

Publication number
WO2022166412A1
WO2022166412A1 PCT/CN2021/137980 CN2021137980W WO2022166412A1 WO 2022166412 A1 WO2022166412 A1 WO 2022166412A1 CN 2021137980 W CN2021137980 W CN 2021137980W WO 2022166412 A1 WO2022166412 A1 WO 2022166412A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
loss
view
data
data enhancement
Prior art date
Application number
PCT/CN2021/137980
Other languages
French (fr)
Chinese (zh)
Inventor
许鸿斌
周志鹏
乔宇
康文雄
吴秋霞
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022166412A1 publication Critical patent/WO2022166412A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • G06T7/596Depth or shape recovery from multiple images from stereo images from three or more stereo images
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present invention relates to the field of image processing, in particular, to a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement.
  • the 3D reconstruction method based on multi-view stereo vision (MVS) aims to restore the 3D structure of the scene by taking natural images and camera positions from multiple pre-given perspectives.
  • MVS multi-view stereo vision
  • traditional 3D reconstruction methods can effectively reconstruct 3D models in general scenarios, due to the limitations of traditional measurement methods, in many cases, traditional 3D reconstruction algorithms can only reconstruct a relatively sparse point cloud, which loses a lot of details. In addition, it is also easily interfered by factors such as noise and lighting.
  • the depth estimation problem in the 3D reconstruction pipeline of these self-supervised methods is transformed into the image reconstruction problem to design self-supervised signals.
  • the depth map and multi-view image predicted by the network are projected to the same view through homography, and the pixel value is calculated based on bilinear interpolation to ensure the differentiable property of the reconstructed image.
  • the self-supervised loss then estimates the difference between the reconstructed image and the original, training the network until convergence.
  • Unsup_MVS sorts and filters out unreliable self-supervised signals based on the correlation of matching features between views; MVS 2 adds a model for adaptively judging occlusion relationships based on the original image reprojection self-supervised signals; M 3 VSNet introduces The normal vector information assists self-supervised training and achieves a certain performance improvement.
  • the current unsupervised/self-supervised 3D reconstruction technology has made many progress, it still has a certain gap with the supervised 3D reconstruction method.
  • the present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement.
  • the specific plans are as follows:
  • a self-supervised 3D reconstruction method based on collaborative segmentation and data augmentation comprising:
  • Image pair acquisition acquire input data, and acquire multi-view image pairs with overlapping areas and similar viewing angles according to the input data;
  • Depth estimation processing obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair;
  • Collaborative segmentation processing obtain semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair;
  • Data enhancement processing obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;
  • Constructing a loss function constructing a loss function according to the photometric consistency loss, the semantic consistency loss, and the data augmentation consistency loss;
  • Model output construct and train a neural network model according to the loss function, and obtain a three-dimensional model corresponding to the input data based on the neural network model.
  • the cooperative segmentation process specifically includes:
  • Obtaining a collaboratively segmented image performing collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;
  • Cross-entropy loss acquisition obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view to obtain a re-projected co-segmented image, and calculate the collaboration between the re-projected co-segmented image and the reference view Cross-entropy loss between segmented images;
  • Semantic consistency loss acquisition The semantic consistency loss is obtained according to the cross-entropy loss.
  • the depth estimation process specifically includes:
  • the photometric consistency loss is obtained from the regression loss.
  • the data enhancement process specifically includes:
  • a data enhancement consistency loss is obtained according to the data loss.
  • the acquisition of the image pair specifically includes:
  • the input data including images or videos
  • Image preprocessing is performed on the multi-view image pair.
  • the "acquiring a pair of multi-view images with similar viewing angles and having the same area in the multi-view images” further includes:
  • the degree of overlapping of viewing angles between the images is calculated according to the matching degree, and the overlapping degrees of viewing angles are sorted to obtain a pair of multi-view images with similar viewing angles and having the same area.
  • the cooperatively segmented image acquisition specifically includes:
  • the expression of the feature map matrix is:
  • the expressions of the first non-negative matrix and the second non-negative matrix are respectively:
  • A is the feature map matrix
  • S is the co-segmented image
  • P is the first non-negative matrix
  • Q is the second non-negative matrix
  • V is the total number of viewing angles
  • H and W are the image Height and width
  • C is the number of channels of the convolutional layer in the convolutional neural network
  • K is the number of columns of the first non-negative matrix P in the non-negative matrix decomposition process
  • Q is a real number.
  • the obtaining of the cross-entropy loss specifically includes:
  • the collaboratively segmented image under the non-reference viewing angle is projected to the reference viewing angle for reconstruction, and the reprojected collaboratively segmented image is obtained;
  • a cross-entropy loss between the reprojected co-segmented image and the co-segmented image in the reference view is calculated.
  • the expressions of the co-segmented image under the reference view and the co-segmented image under the non-reference view are respectively:
  • S 1 is the collaboratively segmented image under the reference viewing angle
  • S i is the collaboratively segmented image under the non-reference viewing angle
  • V is the total number of viewing angles
  • H and W are the height and width of the image
  • K represents the first
  • the number of columns of the non-negative matrix P is also the number of rows of the second non-negative matrix Q
  • i is the non-reference viewing angle, 2 ⁇ i ⁇ V;
  • p j is the position of the pixel under the reference viewing angle, is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image, D represents the depth map predicted by the network, Co-segment the image for the reprojection.
  • the expression of the cross entropy loss is:
  • f(S 1,j ) is the cross-entropy loss
  • L SC is the semantic consistency error
  • M i represents the effective area mapped from the non-reference view homography to the reference view
  • N is a natural number Set
  • j represents the index value of the pixel in the image
  • H and W are the height and width of the image
  • S1 is the collaboratively segmented image under the reference viewing angle
  • i is a non-reference viewing angle.
  • the data augmentation strategy includes random occlusion masks, gamma correction, color perturbation, and random noise.
  • the expression of the data enhancement consistency loss is:
  • the data augmentation consistency loss described in LDA the data augmentation function is the random occlusion mask, for the gamma correction, for the color perturbation and random noise, represents the random occlusion mask
  • the binary non-occlusion effective area mask in , D is the depth map.
  • a self-supervised 3D reconstruction system based on collaborative segmentation and data augmentation comprising:
  • an input unit for acquiring input data, and acquiring pairs of multi-view images with overlapping regions and similar viewing angles according to the input data
  • a depth processing unit configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair
  • a dual-branch processing unit includes a collaborative segmentation unit and a data enhancement unit.
  • the collaborative segmentation unit and the data enhancement unit run in parallel, and the collaborative segmentation unit is used to obtain semantic consistency by performing collaborative segmentation processing on the multi-view image pair.
  • loss; the data enhancement unit is configured to obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;
  • a loss function construction unit configured to construct a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss
  • An output unit configured to construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.
  • the input unit includes:
  • an input data acquisition unit for acquiring input data, the input data including images or videos;
  • a conversion unit configured to judge whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;
  • a screening unit configured to acquire pairs of multi-view images with similar viewing angles and having the same area according to the multi-view images
  • a preprocessing unit configured to perform image preprocessing on the multi-view image pair.
  • the cooperative segmentation unit includes:
  • a segmented image acquisition unit configured to perform collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image
  • a cross-entropy loss acquisition unit is used to acquire a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image the cross-entropy loss with the co-segmented image on the reference view;
  • the semantic loss obtaining unit is configured to obtain the semantic consistency loss according to the cross entropy loss.
  • the deep processing unit includes:
  • a depth image acquisition unit configured to perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image
  • a regression loss acquisition unit configured to acquire a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate a regression loss according to the reprojection view image;
  • the photometric loss obtaining unit is configured to obtain the photometric consistency loss according to the regression loss.
  • the data enhancement unit includes:
  • a data processing unit configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies
  • a data loss obtaining unit configured to supervise the multi-view loss image pair after the data enhancement processing by using the depth image as a pseudo-label, and obtain the data loss under different data enhancement strategies
  • the data consistency loss obtaining unit is configured to obtain the data enhancement consistency loss according to the data loss.
  • the present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Aiming at the brightness consistency ambiguity problem, abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the reliability of the self-supervised signal under noise disturbance.
  • the self-supervised training method proposed by the present invention surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods.
  • the shared semantic information components are dynamically mined from multi-view pairs through clustering.
  • the data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set.
  • Embodiment 1 is a flowchart of a self-supervised three-dimensional reconstruction method according to Embodiment 1 of the present invention
  • Fig. 2 is the input data processing flow chart of Embodiment 1 of the present invention.
  • Embodiment 3 is a flowchart of depth estimation processing in Embodiment 1 of the present invention.
  • FIG. 4 is a schematic diagram of a depth estimation process according to Embodiment 1 of the present invention.
  • FIG. 6 is a schematic diagram of a collaborative segmentation process in Embodiment 1 of the present invention.
  • Fig. 7 is the data enhancement processing flow chart of Embodiment 1 of the present invention.
  • FIG. 8 is a schematic diagram of a data enhancement process according to Embodiment 1 of the present invention.
  • Fig. 9 is the experimental detection result diagram of the embodiment of the present invention 1.
  • Embodiment 10 is a three-dimensional reconstruction result diagram of Embodiment 1 of the present invention.
  • FIG. 11 is another three-dimensional reconstruction result diagram of Embodiment 1 of the present invention.
  • Embodiment 12 is a system module diagram of Embodiment 2 of the present invention.
  • FIG. 13 is a specific structural diagram of a system according to Embodiment 2 of the present invention.
  • the luminance consistency ambiguity problem is the core problem in unsupervised/self-supervised 3D reconstruction methods. Therefore, the limitation of un/self-supervised 3D reconstruction methods can be overcome only by solving the problem of brightness consistency ambiguity.
  • the present invention proposes a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement.
  • Reliability under noise disturbance can not only solve the problems of traditional 3D reconstruction methods such as loss of details, easy to receive noise and light interference, over-reliance on training data, etc., but also solve the defects of conventional un/self-supervised 3D reconstruction methods, surpassing the traditional
  • the unsupervised/self-supervised method can achieve comparable results with some efficient supervised methods, and the whole process does not require any annotation.
  • the self-supervised 3D reconstruction method provided by the present invention surpasses the traditional unsupervised 3D reconstruction method on the DTU data set, and can achieve an effect comparable to the state-of-the-art supervised method.
  • the unsupervised training model finally obtained by the present invention can be directly applied to the Tanks&Temples dataset, which can also surpass the traditional unsupervised method. Since the Tanks&Temples dataset itself contains a large number of illumination changes of special natural scenes, it shows from the side that the present invention has better generalization than other unsupervised methods.
  • the present invention is as close as possible to the lighting effect in the real scene when collecting sample data, restores the noise interference and color disturbance in various scenes, and simulates various natural scenes as much as possible, so the sample has a strong effect. representative.
  • the present invention can be applied to various generalized scenarios, and has stronger pertinence and wider scope of application than conventional self-supervised three-dimensional reconstruction methods.
  • the reference views in this application include the same reference views used in the depth estimation process, the collaborative segmentation process and the data enhancement process.
  • a multi-view pair must be constructed once for each viewpoint, and the multi-view pair is constructed according to which viewpoint, and which viewpoint is the reference viewpoint.
  • N multi-view pairs there will be N multi-view pairs.
  • This embodiment proposes a self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement, as shown in Figures 1-11 of the specification.
  • the process steps are as shown in accompanying drawing 1 of the description, and the specific scheme is as follows:
  • S1 obtain input data, and obtain multi-view image pairs with overlapping regions and similar viewing angles according to the input data;
  • S5. Construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.
  • step S1 acquires input data, and acquires multi-view image pairs having overlapping regions and similar viewing angles according to the input data.
  • the process of step S1 is shown in Figure 2 of the specification, and specifically includes:
  • S12 determine whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;
  • the data collection of the original multi-view image can be completed by capturing images with any camera at various viewing angles or directly capturing a video while the camera is moving.
  • the input data in this embodiment may be images or videos, or It can be an image combined with a video. If it is an image, it is only necessary to extract multi-view images from the input data, filter out multi-view image pairs with similar viewing angles and the same area from the multi-view images, and finally enhance the image quality through basic image preprocessing techniques such as image filtering. Yes; if it is a video, you need to convert the video into a multi-view image first, and filter out multi-view image pairs with similar viewing angles and the same area from the multi-view images, and then perform image preprocessing.
  • the step S13 of selecting a pair of multi-view images specifically includes: performing feature matching on the multi-view images by using two-dimensional scale-invariant image features, and obtaining matching information of pixels and matching degrees of image features;
  • the camera extrinsic parameter matrix is obtained according to the matching information, the degree of overlapping of viewing angles between the images is calculated according to the degree of matching, and the overlapping degrees of viewing angles are sorted to obtain other viewing angles that are close to each viewing angle.
  • the multi-view image pairs After acquiring the multi-view image pairs, feature matching is performed between all multi-view image pairs by two-dimensional scale-invariant image features such as SIFT, ORB, and SURF. Relying on the two-dimensional image pixel point matching information, the beam method adjustment problem between all cameras is solved, and the relative pose relationship between different cameras is calculated, that is, the camera extrinsic parameter matrix.
  • the degree of coincidence of viewing angles between all pairs of image pairs is calculated according to the matching degree of the image features. Sort according to the degree of overlap, and obtain the top 10 viewing angles that are closest to the viewing angle among all remaining viewing angles for each viewing angle. Thereby, the multi-view of N viewing angles can be divided into N groups of multi-view image pairs, which are used for the subsequent stereo vision matching process.
  • the multi-view image pair generally includes 3-7 multi-view images, and selecting multi-view image pairs with similar viewing angles and overlapping regions can facilitate subsequent feature matching. It should be noted that if the viewing angle difference is too large and the overlapping area is too small, the effective area will be very small when the subsequent process finds matching points, which will affect the progress of the process.
  • S2 obtains the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair, and the specific process is shown in FIG. 3 of the specification, including:
  • Depth estimation processing is a commonly used technical means in existing 3D reconstruction methods.
  • the specific process includes: inputting the multi-view image pair and the reference view to the depth estimation network for depth estimation, obtaining a depth map, performing homography mapping between the depth map and the multi-view image pair, and reconstructing the depth image from the non-reference view.
  • the reprojected view image is obtained, and the regression loss, ie the L2 loss, can be obtained by calculating the difference between the reprojected view and the reference view, and the photometric consistency error is obtained based on the L2 loss.
  • the specific principle is shown in Figure 4 of the description.
  • S3 obtains the semantic consistency loss by performing collaborative segmentation processing on the multi-view image pairs, and obtains the data enhancement consistency loss by performing data enhancement processing on the multi-view image pairs, and the collaborative segmentation processing and the data enhancement processing are performed in parallel run.
  • Step S3 is the core step of this embodiment. The two branches of the collaborative segmentation processing and the data enhancement processing are run in parallel to obtain the semantic consistency loss and the data enhancement consistency loss.
  • S312 Obtain a reference perspective and a non-reference perspective, reconstruct the collaboratively segmented image on the non-reference perspective through homography to obtain a reprojected collaboratively segmented image, and calculate the difference between the reprojected collaboratively segmented image and the collaboratively segmented image on the reference perspective The cross entropy loss of ;
  • the collaborative segmentation processing flow includes: inputting the reference view and multi-view image pair into the pre-trained VGG network, and then performing non-negative matrix decomposition to obtain the collaborative segmentation image in the reference perspective and the collaborative segmentation image in the non-reference perspective.
  • the co-segmented images in the perspective are homographed to obtain the re-projected co-segmented images, and the cross-entropy loss between the re-projected co-segmented images and the co-segmented images in the reference perspective is calculated, and then the semantic consistency error is obtained.
  • the specific process is shown in Figure 6 of the description.
  • the cooperative segmentation process is similar to the depth estimation process of step S2.
  • the reference view and multi-view image pairs are input to a pretrained convolutional neural network.
  • each image in the multi-view image pair will be fed into a weighted convolutional neural network to extract features, preferably, the convolutional neural network is a VGG network pre-trained by ImageNet.
  • the convolutional neural network is a VGG network pre-trained by ImageNet.
  • the image of each view will get a corresponding feature map tensor, and the dimension of the feature map tensor is H ⁇ W ⁇ C, where H and W are the height and width of the image, and C is the convolutional neural network.
  • the number of channels in the convolutional layer is the number of channels in the convolutional layer.
  • the feature map tensors of all views are expanded and taken together to form a two-dimensional matrix, that is, the feature map matrix A ⁇ R V ⁇ H ⁇ W ⁇ C , whose dimension is V ⁇ H ⁇ W ⁇ C, where V is the total view number.
  • the feature map matrix is subjected to non-negative matrix decomposition through chain iteration to obtain a first non-negative matrix P and a second non-negative matrix Q.
  • the expressions of the first non-negative matrix P and the second non-negative matrix Q are respectively:
  • the P matrix represents the correlation degree of each pixel of all multi-view images with respect to the semantic cluster center (the vector of each row of the Q matrix), that is, the segmentation confidence.
  • the collaborative segmentation of multi-view images is realized without relying on any supervision signal, and the common semantic information of multi-view images is extracted.
  • the schematic diagram of the non-negative matrix decomposition to achieve collaborative segmentation and extraction of common semantic information is shown in the accompanying drawings of the description.
  • V is the total number of viewing angles
  • H and W are the height and width of the image, and are also the number of rows of the second non-negative matrix Q
  • R is a real number.
  • the solution when non-negative matrix factorization is implemented, the solution often fails when dealing with multiple views of real scenes due to the flaws in the method itself. This problem is largely due to the fact that the iterative solution process is highly dependent on the random initialization state value. Once a good initial value is not encountered, the solution of the entire non-negative matrix decomposition cannot be converged, and the cooperative segmentation will also fail, and finally lead to The entire training process cannot be performed.
  • the original iterative solution process is extended to a multi-branch parallel solution process, and multiple sets of solutions are randomly initialized each time, and the optimal solution is selected and sent to the next iteration process. To a large extent, the problem of solution failure due to bad random initialization values is avoided.
  • this embodiment only needs to mine common semantic components (clusters) in different views, and no longer needs to care about specific scenes and semantic labels. Therefore, the method provided in this embodiment can be generalized to any dynamically changing scene without requiring a lot of tedious and expensive semantic annotation work like other methods.
  • S312 specifically includes: dividing the V views into view pairs consisting of a reference view and a series of non-reference views, a collaboratively segmented image S 1 under the reference view and a collaboratively segmented image S i under the non-reference view
  • the expressions are:
  • the viewing angle with the default sequence number 1 is the reference viewing angle, and the viewing angle with the sequence number i is defined as a non-reference viewing angle, where 2 ⁇ i ⁇ V.
  • the homography formula can calculate the pixel at the position of p j under the reference view and the position in the source view as The corresponding relationship of the pixels:
  • p j is the position of the pixel under the reference viewing angle, is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image or segmentation map, and 1 ⁇ j ⁇ H ⁇ W, D represents the depth map predicted by the network.
  • the co-segmented image Si in the non-reference view can be projected to the reference view to reproject the co-segmented image
  • the expression is:
  • M i represents the effective area mapped from the reference viewing angle homography to the reference viewing angle.
  • the cross-entropy loss between the reconstructed semantic segmentation map and the original semantic segmentation map is calculated for all view pairs. If the predicted depth map is correct, the semantic segmentation map reconstructed from it should also be as similar as possible to the original semantic segmentation map.
  • the calculation formula of the entire semantic consistency loss is as follows:
  • f(S 1,j ) is the cross-entropy loss
  • L SC is the semantic consistency error
  • Mi represents the effective area mapped from the non-reference view homography to the reference view
  • N is the set of natural numbers
  • j represents The index value of the pixel in the image
  • H and W are the height and width of the image
  • S 1 is the co-segmented image under the reference viewing angle
  • i is the non-reference viewing angle.
  • the weight of semantic consistency loss during training is set to 0.1 by default.
  • the data augmentation operation itself causes the pixel values of multi-view images to change
  • directly applying the data augmentation strategy may break the luminance consistency assumption of self-supervised signals.
  • self-supervised signals come from the data itself and are more susceptible to noise interference from the data itself.
  • the original self-supervised training branch is expanded into a dual-stream structure, one standard branch is only supervised by photometric stereo vision self-supervised signals, while the other branch introduces various random data augmentation changes.
  • the data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions, respectively.
  • existing self-supervised signals based on photometric stereo consistency are often limited by the luminance consistency assumption and do not allow data augmentation operations. Because data augmentation changes the pixel distribution of the image, the luminance consistency assumption is broken, which in turn leads to luminance consistency ambiguity, making the self-supervised signal less reliable.
  • the specific process of data enhancement processing is shown in Figure 7 of the description, including:
  • the specific process of data enhancement processing includes: inputting the reference view and multi-view image pairs to the depth estimation network for depth estimation processing, obtaining a depth map, obtaining an effective area mask according to the depth map, and using the effective area mask as a pseudo-label. After performing random data enhancement on the reference view and multi-view image pairs, they are input to the depth estimation network for depth estimation processing to obtain the contrast depth map, and the difference between the contrast depth map and the pseudo-label is calculated, and then the data enhancement consistency loss is obtained.
  • the principle of data enhancement processing is shown in Figure 8 of the description.
  • data augmentation strategies include random occlusion masks, gamma correction, color perturbation, and random noise.
  • the original multi-view is I
  • the data enhancement function acting on the multi-view image pair is ⁇ ⁇
  • the multi-view after data enhancement is ⁇ represents the parameters related to specific operations in the data augmentation process.
  • the data enhancements used are: random occlusion masks Gamma Correction Color perturbation and random noise
  • a binary mask occlusion mask can be randomly generated A part of the area under the reference view, and Indicates the remaining regions that are valid in prediction. and The included area should remain invariant to occlusion changes, so the entire system should remain invariant on such artificial occlusion edges, which can guide the model to pay more attention to the processing of occlusion edges.
  • Gamma correction is a common data augmentation operation used to adjust image lighting. To simulate as many and complex lighting variations as possible, random gamma correction is introduced for data augmentation.
  • S322 uses the depth image as a pseudo-label to supervise the multi-view image pair after data enhancement processing, and obtains the data loss under different data enhancement strategies.
  • the data augmentation strategy needs to ensure that there is a relatively reliable reference standard.
  • this reference standard is often the invariance of random data augmentation to the true value label, but in self-supervised training, this assumption cannot be established, because the true value cannot be obtained. value label. Therefore, in this embodiment, the depth map of the branch prediction of the standard self-supervised training is used as the false true value label, that is, the depth map in the depth estimation process in step S2. Immutability. This operation decouples data augmentation from the self-supervised loss without affecting the brightness consistency assumption of the self-supervised loss.
  • step S321 Several data enhancement strategies in step S321 can be combined to obtain a comprehensive data enhancement function:
  • the depth map for standard self-supervised branch prediction (depth estimation process) is D
  • the depth map for data augmentation branch prediction is To calculate the data augmentation consistency loss, the expression of the data augmentation consistency loss LDA is:
  • a random occlusion mask for gamma correction, for colored detours and random noise, represents a random occlusion mask
  • the binary non-occlusion valid area mask in is the dot product, and D is the depth map predicted by the network.
  • the loss L DA is calculated using the above formula.
  • the influence weights of the data augmentation loss are adaptively adjusted according to the training progress. The weight is 0.01 at the beginning, and then doubles every two epochs. The data augmentation loss only plays a substantial role after the network has converged.
  • the entire self-supervised training framework uses many operations, especially the data augmentation branch runs the entire network forward twice.
  • the parallel forward-reverse update strategy is directly used, the GPU memory during the training process is not enough (11G by default), and there is an existing overflow problem.
  • a strategy of exchanging time for space is adopted for the problem of video memory overflow.
  • the original one-step forward-computing self-supervised loss-backpropagation process is divided into two sets of forward and backpropagation processes.
  • the self-supervised loss of the standard branch is calculated in the forward direction, then the gradient is updated by back-propagation, the cache is cleared, and the depth map result predicted by the standard branch is saved as a pseudo-label; then the self-supervised loss of the data-enhanced branch is calculated in the forward direction. its training. Since the gradient updates of multiple losses are decoupled into different stages, there is no need to occupy the video memory at the same time, which greatly reduces the video memory occupation of the GPU.
  • S4 constructs a loss function according to the loss of photometric consistency, the loss of semantic consistency, and the loss of data enhancement consistency.
  • the expression of the loss function L is:
  • L PC is the photometric consistency loss
  • L D ⁇ is the data enhancement consistency loss
  • L SC is the semantic consistency loss
  • the traditional stereo matching is replaced by the dense depth map estimation based on deep learning, a neural network model is constructed and trained according to the loss function, and the neural network model is applied to the complete three-dimensional reconstruction to obtain a three-dimensional model. It can be comparable to the method of manually annotating samples.
  • This embodiment provides an alternative solution for training a high-precision 3D reconstruction model at a low cost, which can be extended to scenarios related to 3D reconstruction, such as map exploration, automatic driving, and AR/VR.
  • the method proposed in this embodiment is detected based on the DTU data set, and the experimental detection result is shown in FIG. 9 in the description.
  • the DACS-MS proposed in the embodiment of the present invention has an average reconstruction error of 0.358 mm for each point on the DTU data set, which is much smaller than similar unsupervised methods such as MVS, MVS 2 , and M 3 VSNet.
  • DACS-MS is also close to the state-of-the-art supervised methods and surpasses some existing supervised methods.
  • Experimental results show that the self-supervised method proposed in this implementation outperforms traditional unsupervised 3D reconstruction methods on the DTU dataset and can achieve comparable results to state-of-the-art supervised methods.
  • FIG. 10 and FIG. 11 The effect of the model reconstructed by using the self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement provided in this embodiment is shown in FIG. 10 and FIG. 11 in the description.
  • the experimental results of this embodiment are shown in the third column of the accompanying drawings. From the specific experimental results, this embodiment can fully achieve the same or similar technical effects as the supervised method, and the reconstructed 3D model meets the technical requirements.
  • This embodiment provides a self-supervised 3D reconstruction method based on collaborative segmentation and data enhancement.
  • abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the self-supervised signal under noise disturbance. reliability.
  • the self-supervised training method proposed in this embodiment surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods.
  • the shared semantic information components are dynamically mined from multi-view pairs through clustering.
  • the data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately.
  • this embodiment modularizes a self-supervised 3D reconstruction method based on collaborative segmentation and data enhancement proposed in Embodiment 1 to form a self-supervised 3D reconstruction system based on collaborative segmentation and data enhancement,
  • a schematic diagram of each module is shown in FIG. 12 in the specification, and a complete system structure diagram is shown in FIG. 13 in the specification.
  • a self-supervised three-dimensional reconstruction system based on collaborative segmentation and data enhancement includes an input unit 1, a depth processing unit 2, a dual-branch processing unit 3, a loss function construction unit 4 and an output unit 5 that are connected in sequence.
  • the input unit 1 is used for acquiring input data, and acquiring multi-view image pairs having overlapping regions and similar viewing angles according to the input data.
  • the input unit includes an input data acquisition unit 11 , a conversion unit 12 , a screening unit 13 and a preprocessing unit 14 .
  • the depth processing unit 2 is configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair.
  • the depth processing unit includes a depth image acquisition unit 21 , a regression loss acquisition unit 22 and a luminosity loss acquisition unit 23 .
  • the dual-branch processing unit 3 includes a collaborative segmentation unit 31 and a data enhancement unit 32.
  • the collaborative segmentation unit 31 and the data enhancement unit 32 run in parallel.
  • the collaborative segmentation unit 31 is used to obtain semantic consistency by performing collaborative segmentation processing on multi-view image pairs.
  • the data enhancement unit 32 is configured to obtain the data enhancement consistency loss by performing data enhancement processing on the multi-view image pair.
  • the loss function construction unit 4 is used for constructing a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss.
  • the output unit 5 is used for constructing and training a neural network model according to the loss function, and obtaining a three-dimensional model of the input data based on the neural network model.
  • the depth processing unit 2 includes a depth image acquisition unit 21 , a regression loss acquisition unit 22 and a luminance loss acquisition unit 23 .
  • the basic principles include: inputting multi-view image pairs and reference views into the depth estimation network for depth estimation, obtaining a depth map, performing homography mapping on the depth map and multi-view image pairs, and reconstructing the depth images from non-reference views.
  • the reprojected view image is obtained, the regression loss can be obtained by calculating the difference between the reprojected view and the reference view, and the photometric consistency error is obtained based on the regression loss.
  • the specific structure includes:
  • the depth image acquisition unit 21 is configured to perform depth estimation on a multi-view image through a depth estimation network to acquire a depth image.
  • the regression loss obtaining unit 22 is configured to obtain a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate the regression loss according to the reprojection view image.
  • the photometric loss obtaining unit 23 is configured to obtain the photometric consistency loss according to the regression loss.
  • the collaborative segmentation unit 31 includes a segmented image acquisition unit 311 , a cross-entropy loss acquisition unit 312 and a semantic loss acquisition unit 313 .
  • the basic principle of the collaborative segmentation unit 31 includes: inputting a reference view and a multi-view image pair into a pre-trained VGG network, and then performing non-negative matrix decomposition to obtain the collaborative segmentation image under the reference view and the collaborative segmentation image under the non-reference view, Perform homography projection on the co-segmented image in the non-reference view to obtain the re-projected co-segmented image, calculate the cross-entropy loss between the re-projected co-segmented image and the co-segmented image under the reference view, and then obtain the semantic consistency error.
  • the specific structure includes:
  • the segmented image obtaining unit 311 is configured to perform cooperative segmentation on the multi-view image pair by using a non-negative matrix to obtain a cooperatively segmented image.
  • the cross-entropy loss obtaining unit 312 is configured to obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image and the reference view Cross-entropy loss between co-segmented images on .
  • the semantic loss obtaining unit 313 is configured to obtain the semantic consistency loss according to the cross entropy loss.
  • the data enhancement unit 32 includes a data processing unit 321 , a data loss obtaining unit 322 and a data consistency loss obtaining unit 323 .
  • the basic principle of the data enhancement unit 32 includes: inputting the reference view and the multi-view image pair to the depth processing unit to perform depth estimation processing, obtaining a depth map, obtaining an effective area mask according to the depth map, and using the effective area mask as a pseudo-label. After random data enhancement is performed on the reference view and multi-view image pair, the depth estimation process is performed with the depth estimation network to obtain the contrast depth map, and the difference between the contrast depth map and the pseudo-label is calculated, and then the consistency loss of data enhancement is obtained.
  • the specific structure includes:
  • the data processing unit 321 is configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies.
  • the data processing unit is provided with a depth estimation network.
  • the data loss obtaining unit 322 is configured to supervise the multi-view loss image pair after data enhancement processing with the depth image as a pseudo-label, and obtain the data loss under different data enhancement strategies.
  • the data consistency loss obtaining unit 323 is configured to obtain the data enhancement consistency loss according to the data loss.
  • the input unit 1 includes an input data acquisition unit 11 , a conversion unit 12 , a screening unit 13 and a preprocessing unit 14 .
  • the specific structure includes:
  • the input data acquisition unit 11 is used for acquiring input data, and the input data includes images or videos.
  • the conversion unit 12 is configured to determine whether the input data is an image: if so, select a multi-view image from the input data. If not, convert the input data to a multi-view image.
  • the screening unit 13 is configured to acquire, according to the multi-view images, pairs of multi-view images with similar viewing angles and having the same area.
  • the preprocessing unit 14 is configured to perform image preprocessing on the multi-view image pair.
  • this embodiment proposes a deep learning-based sample image generation system, and modularizes the method of Embodiment 1 to form a specific system, making it more practical.
  • the present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement.
  • Aiming at the brightness consistency ambiguity problem abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the reliability of the self-supervised signal under noise disturbance.
  • the self-supervised training method proposed in the present invention surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods.
  • Based on the semantic consistency loss of collaborative segmentation the shared semantic information components are dynamically mined from multi-view pairs through clustering.
  • the data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set. The whole process does not require any label data and does not rely on the true value labeling. Instead, effective information is mined from the data itself to implement network training, which greatly saves costs and shortens the reconstruction process. Integrating depth prediction, collaborative segmentation and data enhancement together, on the basis of solving the problem of video memory overflow, the accuracy of the self-supervised signal is improved, so that this embodiment has better generalization.
  • the method is modularized to form a specific system, making it more practical.
  • modules or steps of the present invention can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed on a network composed of multiple computing devices.
  • they may be implemented in program code executable by a computer device, so that they can be stored in a storage device and executed by the computing device, or they can be fabricated separately into individual integrated circuit modules, or a plurality of modules of them Or the steps are made into a single integrated circuit module to realize.
  • the present invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Said method comprises: acquiring input data, and acquiring a multi-view image pair according to the input data; acquiring a photometric consistency loss by performing depth estimation processing on the multi-view image pair; acquiring a semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair; acquiring a data enhancement consistency loss by performing data enhancement processing on the multi-view image pair; constructing a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss; and constructing and training a neural network model according to the loss function, and acquiring a three-dimensional model corresponding to the input data on the basis of the neural network model. By introducing a semantic cue and embedding a data enhancement mechanism, the reliability of a self-supervised signal under noise disturbance is enhanced, the precision and performance of the self-supervised algorithm are improved, the cost is low, the generalization is high, and the application scene is wide.

Description

基于协同分割与数据增强的自监督三维重建方法及系统Self-supervised 3D reconstruction method and system based on collaborative segmentation and data augmentation 技术领域technical field
本发明涉及图像处理领域,具体而言,涉及基于协同分割与数据增强的自监督三维重建方法及系统。The present invention relates to the field of image processing, in particular, to a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement.
背景技术Background technique
基于多视图立体视觉(Multi-view stereo,MVS)的三维重建方法旨在通过预先给定的多个视角拍摄的自然图像和相机位置,还原出场景的三维结构。传统的三维重建方法虽然在通用场景下能够有效重建三维模型,但是由于传统的度量方法的局限性,很多时候传统三维重建算法只能重建出一个相对稀疏的点云,损失了相当多的细节。此外,还很容易受到噪声光照等等因素的干扰。The 3D reconstruction method based on multi-view stereo vision (MVS) aims to restore the 3D structure of the scene by taking natural images and camera positions from multiple pre-given perspectives. Although traditional 3D reconstruction methods can effectively reconstruct 3D models in general scenarios, due to the limitations of traditional measurement methods, in many cases, traditional 3D reconstruction algorithms can only reconstruct a relatively sparse point cloud, which loses a lot of details. In addition, it is also easily interfered by factors such as noise and lighting.
随着深度学习的快速发展,越来越多的研究者开始着手于将其应用在三维重建领域。借助于深度卷积神经网络(Convolutional neural network,CNN)的强大的特征提取能力,这些基于学习的方法将CNN提取的特征图通过单应性映射投影到同一个参考视角上,并构建在若干种深度下这些视角之间的匹配误差体(cost volume,CV)。匹配误差体会预测出在参考视角的深度图。每个视角下的深度图融合在一起便可以重建出整个场景的三维信息。这类基于数据驱动的三维重建方法,例如MVSNet、R-MVSNet、Point-MVSNet,取得了比传统三维重建方法更好的效果。With the rapid development of deep learning, more and more researchers have begun to apply it in the field of 3D reconstruction. With the help of the powerful feature extraction capability of deep convolutional neural network (CNN), these learning-based methods project the feature maps extracted by CNN to the same reference perspective through homography mapping, and build on several The matching error volume (CV) between these views at depth. The matching error body predicts the depth map at the reference view. The depth maps from each viewpoint are fused together to reconstruct the 3D information of the entire scene. Such data-driven 3D reconstruction methods, such as MVSNet, R-MVSNet, and Point-MVSNet, have achieved better results than traditional 3D reconstruction methods.
然而这些方法高度依赖于可用的大规模三维数据集,如果没有足够的有标签样本,便难以取得较好的效果。此外,对于三维重建来说,获取准确的真值样本标签较为困难且成本较高。由此,便衍生了一系列无/自监督的三维重建方法,旨在借助人为设计的自监督信号替代大量昂贵的真值标 签来训练深度三维重建网络。However, these methods are highly dependent on the available large-scale 3D datasets, and it is difficult to achieve good results without sufficient labeled samples. In addition, for 3D reconstruction, it is difficult and expensive to obtain accurate ground-truth sample labels. As a result, a series of un/self-supervised 3D reconstruction methods have been derived, aiming to replace a large number of expensive ground-truth labels with artificially designed self-supervised signals to train deep 3D reconstruction networks.
这些自监督方法三维重建流程中的深度估计问题转换为图像重建问题设计自监督信号。网络预测的深度图和多视角图像通过单应性映射投影到同一视角,且基于双线性插值计算像素值可以保证重建图像的可微分性质。随后自监督损失会估计重建图像与原图像的差异,训练网络直至收敛。Unsup_MVS根据视角间匹配特征的相关性排序并滤除了不可靠的自监督信号;MVS 2在原始的图像重投影自监督信号的基础之上添加了自适应判断遮挡关系的模型;M 3VSNet引入了法向量信息辅助自监督训练,取得了一定的性能提升。尽管目前的无/自监督三维重建技术取得了诸多进展,但是依然与有监督三维重建方法有一定的差距。 The depth estimation problem in the 3D reconstruction pipeline of these self-supervised methods is transformed into the image reconstruction problem to design self-supervised signals. The depth map and multi-view image predicted by the network are projected to the same view through homography, and the pixel value is calculated based on bilinear interpolation to ensure the differentiable property of the reconstructed image. The self-supervised loss then estimates the difference between the reconstructed image and the original, training the network until convergence. Unsup_MVS sorts and filters out unreliable self-supervised signals based on the correlation of matching features between views; MVS 2 adds a model for adaptively judging occlusion relationships based on the original image reprojection self-supervised signals; M 3 VSNet introduces The normal vector information assists self-supervised training and achieves a certain performance improvement. Although the current unsupervised/self-supervised 3D reconstruction technology has made many progress, it still has a certain gap with the supervised 3D reconstruction method.
综上,尽管现有的无/自监督三维重建方法能取得一定的效果,但是与相同情况下的有监督三维重建的方法相比依然有较大的差距。这也导致无监督三维重建方法不够可靠。To sum up, although the existing unsupervised/self-supervised 3D reconstruction methods can achieve certain results, there is still a large gap compared with the supervised 3D reconstruction methods in the same situation. This also makes unsupervised 3D reconstruction methods less reliable.
因此,需要一种无/自监督三维重建方法,能够解决上述问题。Therefore, there is a need for an un/self-supervised 3D reconstruction method that can solve the above problems.
发明内容SUMMARY OF THE INVENTION
基于现有技术存在的问题,本发明提供了基于协同分割与数据增强的自监督三维重建方法及系统。具体方案如下:Based on the problems existing in the prior art, the present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. The specific plans are as follows:
一种基于协同分割与数据增强的自监督三维重建方法,包括:A self-supervised 3D reconstruction method based on collaborative segmentation and data augmentation, comprising:
图像对获取:获取输入数据,根据所述输入数据获取具有重合区域且视角相似的多视角图像对;Image pair acquisition: acquire input data, and acquire multi-view image pairs with overlapping areas and similar viewing angles according to the input data;
深度估计处理:通过对所述多视角图像对进行深度估计处理,获取光度一致性损失;Depth estimation processing: obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair;
协同分割处理:通过对所述多视角图像对进行协同分割处理,获取语义一致性损失;Collaborative segmentation processing: obtain semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair;
数据增强处理:通过对所述多视角图像对进行数据增强处理,获取数 据增强一致性损失;Data enhancement processing: obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;
构建损失函数:根据所述光度一致性损失、所述语义一致性损失和所述数据增强一致性损失构建损失函数;Constructing a loss function: constructing a loss function according to the photometric consistency loss, the semantic consistency loss, and the data augmentation consistency loss;
模型输出:根据所述损失函数构建并训练神经网络模型,基于所述神经网络模型获取与所述输入数据对应的三维模型。Model output: construct and train a neural network model according to the loss function, and obtain a three-dimensional model corresponding to the input data based on the neural network model.
在一个具体的实施例中,所述协同分割处理具体包括:In a specific embodiment, the cooperative segmentation process specifically includes:
协同分割图像获取:通过非负矩阵对所述多视角图像对进行协同分割,获取协同分割图像;Obtaining a collaboratively segmented image: performing collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;
交叉熵损失获取:获取参考视角和非参考视角,将所述非参考视角上的协同分割图像进行重建得到重投影协同分割图像,并计算所述重投影协同分割图像与所述参考视角上的协同分割图像之间的交叉熵损失;Cross-entropy loss acquisition: obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view to obtain a re-projected co-segmented image, and calculate the collaboration between the re-projected co-segmented image and the reference view Cross-entropy loss between segmented images;
语义一致性损失获取:根据所述交叉熵损失获取语义一致性损失。Semantic consistency loss acquisition: The semantic consistency loss is obtained according to the cross-entropy loss.
在一个具体的实施例中,所述深度估计处理具体包括:In a specific embodiment, the depth estimation process specifically includes:
基于深度估计网络对所述多视角图像进行深度估计,获取深度图像;Perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;
获取参考视角和非参考视角,将所述非参考视角上的深度图像进行重建得到重投影视图像,并根据所述重投影视图像计算回归损失;obtaining a reference viewing angle and a non-reference viewing angle, reconstructing the depth image on the non-reference viewing angle to obtain a reprojected view image, and calculating a regression loss according to the reprojected view image;
根据所述回归损失获取光度一致性损失。The photometric consistency loss is obtained from the regression loss.
在一个具体的实施例中,所述数据增强处理具体包括:In a specific embodiment, the data enhancement process specifically includes:
采用不同的数据增强策略对所述多视角图像对进行数据增强;Using different data enhancement strategies to perform data enhancement on the multi-view image pair;
以所述深度图像为伪标签对数据增强后的多视角图像对进行监督,获取不同所述数据增强策略下的数据损失;Using the depth image as a pseudo-label to supervise the data-enhanced multi-view image pair to obtain data loss under different data enhancement strategies;
根据所述数据损失获取数据增强一致性损失。A data enhancement consistency loss is obtained according to the data loss.
在一个具体的实施例中,所述图像对获取具体包括:In a specific embodiment, the acquisition of the image pair specifically includes:
获取输入数据,所述输入数据包括图像或视频;obtaining input data, the input data including images or videos;
判断所述输入数据是否为图像:若是,则在所述输入数据中选取多视角图像;若否,则将所述输入数据转换为多视角图像;Determine whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;
在所述多视角图像中获取视角相似且具有相同区域的多视角图像对;obtaining a pair of multi-view images with similar viewing angles and having the same area in the multi-view images;
对所述多视角图像对进行图像预处理。Image preprocessing is performed on the multi-view image pair.
在一个具体的实施例中,所述“在所述多视角图像中获取视角相似且具有相同区域的多视角图像对”还包括:In a specific embodiment, the "acquiring a pair of multi-view images with similar viewing angles and having the same area in the multi-view images" further includes:
通过二维尺度不变图像特征对所述多视角图像进行特征匹配,获取图像特征的匹配程度;Perform feature matching on the multi-view image by using two-dimensional scale-invariant image features to obtain the matching degree of image features;
根据所述匹配程度计算图像之间的视角重合程度,并对所述视角重合程度进行排序,获取视角相似且具有相同区域的多视角图像对。The degree of overlapping of viewing angles between the images is calculated according to the matching degree, and the overlapping degrees of viewing angles are sorted to obtain a pair of multi-view images with similar viewing angles and having the same area.
在一个具体的实施例中,所述协同分割图像获取具体包括:In a specific embodiment, the cooperatively segmented image acquisition specifically includes:
通过卷积神经网络对所述多视角图像对中的每张图像进行特征提取,获取每个视角的特征图张量,所有视角的特征图张量构成特征图矩阵;Perform feature extraction on each image in the multi-view image pair by using a convolutional neural network to obtain a feature map tensor of each view, and the feature map tensors of all views form a feature map matrix;
通过链式迭代式对所述特征图矩阵进行非负矩阵分解,求得第一非负矩阵和第二非负矩阵;Perform non-negative matrix decomposition on the feature map matrix through chain iteration to obtain a first non-negative matrix and a second non-negative matrix;
将所述第一非负矩阵转换为与图像维度对应的格式,获取协同分割图像。Convert the first non-negative matrix into a format corresponding to the image dimension to obtain a collaboratively segmented image.
在一个具体的实施例中,所述特征图矩阵的表达式为:In a specific embodiment, the expression of the feature map matrix is:
A∈R V×H×W×C A∈R V×H×W×C
所述第一非负矩阵和所述第二非负矩阵的表达式分别为:The expressions of the first non-negative matrix and the second non-negative matrix are respectively:
P∈R V×H×W×K,Q∈R C×K P∈R V×H×W×K , Q∈R C×K
所述协同分割图像的表达式为:The expression of the collaboratively segmented image is:
S∈R V×H×W×K S∈R V×H×W×K
其中,A为所述特征图矩阵,S为所述协同分割图像,P为所述第一非负矩阵,Q为所述第二非负矩阵,V为总视角数,H和W为图像的高和宽,C为所述卷积神经网络中卷积层的通道数,K表示非负矩阵分解过程中的所述第一非负矩阵P的列数,也是所述第二非负矩阵Q的行数,R为实数。Among them, A is the feature map matrix, S is the co-segmented image, P is the first non-negative matrix, Q is the second non-negative matrix, V is the total number of viewing angles, H and W are the image Height and width, C is the number of channels of the convolutional layer in the convolutional neural network, K is the number of columns of the first non-negative matrix P in the non-negative matrix decomposition process, and is also the second non-negative matrix Q The number of rows, R is a real number.
在一个具体的实施例中,所述交叉熵损失获取具体包括:In a specific embodiment, the obtaining of the cross-entropy loss specifically includes:
在所有视角中选取一个参考视角,除所述参考视角以外的视角为非参考视角,获取所述参考视角下的协同分割图像和所述非参考视角下的协同分割图像;Selecting a reference viewing angle from all viewing angles, and viewing angles other than the reference viewing angle are non-reference viewing angles, and obtaining a collaboratively segmented image under the reference viewing angle and a collaboratively segmented image under the non-reference viewing angle;
根据单应性公式计算同一位置的像素分别在所述参考视角下与所述非参考视角下的对应关系;Calculate the corresponding relationship between the pixels at the same position under the reference viewing angle and the non-reference viewing angle respectively according to the homography formula;
基于单应性映射公式和双线性插值策略,将所述非参考视角下的协同分割图像投影到参考视角下进行重建,获得重投影协同分割图像;Based on the homography mapping formula and the bilinear interpolation strategy, the collaboratively segmented image under the non-reference viewing angle is projected to the reference viewing angle for reconstruction, and the reprojected collaboratively segmented image is obtained;
计算所述重投影协同分割图像与所述参考视角下的协同分割图像之间的交叉熵损失。A cross-entropy loss between the reprojected co-segmented image and the co-segmented image in the reference view is calculated.
在一个具体的实施例中,所述参考视角下的协同分割图像和所述非参考视角下的协同分割图像的表达式分别为:In a specific embodiment, the expressions of the co-segmented image under the reference view and the co-segmented image under the non-reference view are respectively:
S 1∈? H×W×K,S i∈? H×W×K S 1 ∈? H×W×K , S i ∈? H×W×K
其中,S 1为所述参考视角下的协同分割图像,S i为所述非参考视角下的协同分割图像V为总视角数,H和W为图像的高和宽,K表示所述第一非负矩阵P的列数,也是所述第二非负矩阵Q的行数,i为非参考视角,2≤i≤V; Wherein, S 1 is the collaboratively segmented image under the reference viewing angle, S i is the collaboratively segmented image under the non-reference viewing angle, V is the total number of viewing angles, H and W are the height and width of the image, and K represents the first The number of columns of the non-negative matrix P is also the number of rows of the second non-negative matrix Q, i is the non-reference viewing angle, 2≤i≤V;
所述对应关系表达式为:The corresponding relationship expression is:
Figure PCTCN2021137980-appb-000001
Figure PCTCN2021137980-appb-000001
所述重投影协同分割图像
Figure PCTCN2021137980-appb-000002
表达式为:
The reprojected co-segmentation image
Figure PCTCN2021137980-appb-000002
The expression is:
Figure PCTCN2021137980-appb-000003
Figure PCTCN2021137980-appb-000003
其中,p j为像素在参考视角下的位置,
Figure PCTCN2021137980-appb-000004
为像素在非参考视角下的位置,j表示图像中像素的索引值,D表示网络预测出的深度图,
Figure PCTCN2021137980-appb-000005
为所述重投影协同分割图像。
Among them, p j is the position of the pixel under the reference viewing angle,
Figure PCTCN2021137980-appb-000004
is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image, D represents the depth map predicted by the network,
Figure PCTCN2021137980-appb-000005
Co-segment the image for the reprojection.
在一个具体的实施例中,所述交叉熵损失表达式为:In a specific embodiment, the expression of the cross entropy loss is:
f(S 1,j)=onehot(argmax(S 1,j)) f(S 1,j )=onehot(argmax(S 1,j ))
所述语义一致性误差表达式为:The semantic consistency error expression is:
Figure PCTCN2021137980-appb-000006
Figure PCTCN2021137980-appb-000006
其中,f(S 1,j)为所述交叉熵损失,L SC为所述语义一致性误差,M i表示的是从非参考视角单应性投影映射到参考视角的有效区域,N为自然数集,j表示图像中像素的索引值,H和W为图像的高和宽,
Figure PCTCN2021137980-appb-000007
为所述重投影协同分割图像,S 1为所述参考视角下的协同分割图像,i为非参考视角。
Among them, f(S 1,j ) is the cross-entropy loss, L SC is the semantic consistency error, M i represents the effective area mapped from the non-reference view homography to the reference view, and N is a natural number Set, j represents the index value of the pixel in the image, H and W are the height and width of the image,
Figure PCTCN2021137980-appb-000007
For the reprojected collaboratively segmented image, S1 is the collaboratively segmented image under the reference viewing angle, and i is a non-reference viewing angle.
在一个具体的实施例中,所述数据增强策略包括随机遮挡掩码、伽马校正、颜色扰动和随机噪声。In a specific embodiment, the data augmentation strategy includes random occlusion masks, gamma correction, color perturbation, and random noise.
在一个具体的实施例中,所述数据增强一致性损失的表达式为:In a specific embodiment, the expression of the data enhancement consistency loss is:
Figure PCTCN2021137980-appb-000008
Figure PCTCN2021137980-appb-000008
其中,L DA所述数据增强一致性损失,数据增强函数
Figure PCTCN2021137980-appb-000009
Figure PCTCN2021137980-appb-000010
为所述随机遮挡掩码,
Figure PCTCN2021137980-appb-000011
为所述伽马校正,
Figure PCTCN2021137980-appb-000012
为所述颜色扰动和随机噪声,
Figure PCTCN2021137980-appb-000013
表示所述随机遮挡掩码
Figure PCTCN2021137980-appb-000014
中的二进制非遮挡有效区域掩码,D为所述深度图。
Among them, the data augmentation consistency loss described in LDA, the data augmentation function
Figure PCTCN2021137980-appb-000009
Figure PCTCN2021137980-appb-000010
is the random occlusion mask,
Figure PCTCN2021137980-appb-000011
for the gamma correction,
Figure PCTCN2021137980-appb-000012
for the color perturbation and random noise,
Figure PCTCN2021137980-appb-000013
represents the random occlusion mask
Figure PCTCN2021137980-appb-000014
The binary non-occlusion effective area mask in , D is the depth map.
一种基于协同分割与数据增强的自监督三维重建系统,包括:A self-supervised 3D reconstruction system based on collaborative segmentation and data augmentation, comprising:
输入单元,用于获取输入数据,根据所述输入数据获取具有重合区域且视角相似的多视角图像对;an input unit for acquiring input data, and acquiring pairs of multi-view images with overlapping regions and similar viewing angles according to the input data;
深度处理单元,用于通过对所述多视角图像对进行深度估计处理,获取光度一致性损失,a depth processing unit, configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair,
双支处理单元,包括协同分割单元和数据增强单元,所述协同分割单元和所述数据增强单元并行运行,协同分割单元用于通过对所述多视角图像对进行协同分割处理,获取语义一致性损失;数据增强单元用于通过对所述多视角图像对进行数据增强处理,获取数据增强一致性损失;A dual-branch processing unit includes a collaborative segmentation unit and a data enhancement unit. The collaborative segmentation unit and the data enhancement unit run in parallel, and the collaborative segmentation unit is used to obtain semantic consistency by performing collaborative segmentation processing on the multi-view image pair. loss; the data enhancement unit is configured to obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;
损失函数构建单元,用于根据所述光度一致性损失、所述语义一致性 损失和所述数据增强一致性损失构建损失函数;a loss function construction unit, configured to construct a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss;
输出单元,用于根据所述损失函数构建并训练神经网络模型,基于所述神经网络模型获取所述输入数据的三维模型。An output unit, configured to construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.
在一个具体的实施例中,所述输入单元包括:In a specific embodiment, the input unit includes:
输入数据获取单元,用于获取输入数据,所述输入数据包括图像或视频;an input data acquisition unit for acquiring input data, the input data including images or videos;
转换单元,用于判断所述输入数据是否为图像:若是,则在所述输入数据中选取多视角图像;若否,则将所述输入数据转换为多视角图像;A conversion unit, configured to judge whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;
筛选单元,用于根据所述多视角图像获取视角相似且具有相同区域的多视角图像对;a screening unit, configured to acquire pairs of multi-view images with similar viewing angles and having the same area according to the multi-view images;
预处理单元,用于对所述多视角图像对进行图像预处理。A preprocessing unit, configured to perform image preprocessing on the multi-view image pair.
在一个具体的实施例中,所述协同分割单元包括:In a specific embodiment, the cooperative segmentation unit includes:
分割图像获取单元,用于通过非负矩阵对所述多视角图像对进行协同分割,获取协同分割图像;a segmented image acquisition unit, configured to perform collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;
交叉熵损失获取单元,用于获取参考视角和非参考视角,通过单应性映射将所述非参考视角上的协同分割图像进行重建得到重投影协同分割图像,并计算所述重投影协同分割图像与所述参考视角上的协同分割图像之间的交叉熵损失;A cross-entropy loss acquisition unit is used to acquire a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image the cross-entropy loss with the co-segmented image on the reference view;
语义损失获取单元,用于根据所述交叉熵损失获取语义一致性损失。The semantic loss obtaining unit is configured to obtain the semantic consistency loss according to the cross entropy loss.
在一个具体的实施例中,所述深度处理单元包括:In a specific embodiment, the deep processing unit includes:
深度图像获取单元,用于基于深度估计网络对所述多视角图像进行深度估计,获取深度图像;a depth image acquisition unit, configured to perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;
回归损失获取单元,用于获取参考视角和非参考视角,通过单应性映射将所述非参考视角上的深度图像进行重建得到重投影视图像,并根据所述重投影视图像计算回归损失;a regression loss acquisition unit, configured to acquire a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate a regression loss according to the reprojection view image;
光度损失获取单元,用于根据所述回归损失获取光度一致性损失。The photometric loss obtaining unit is configured to obtain the photometric consistency loss according to the regression loss.
在一个具体的实施例中,所述数据增强单元包括:In a specific embodiment, the data enhancement unit includes:
数据处理单元,用于采用不同的数据增强策略对所述多视角图像对进行数据增强处理;a data processing unit, configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies;
数据损失获取单元,用于以所述深度图像为伪标签对所述数据增强处理后的多视角损失图像对进行监督,获取不同所述数据增强策略下的数据损失;a data loss obtaining unit, configured to supervise the multi-view loss image pair after the data enhancement processing by using the depth image as a pseudo-label, and obtain the data loss under different data enhancement strategies;
数据一致性损失获取单元,用于根据所述数据损失获取数据增强一致性损失。The data consistency loss obtaining unit is configured to obtain the data enhancement consistency loss according to the data loss.
本发明具有如下有益效果:The present invention has the following beneficial effects:
本发明提供了基于协同分割与数据增强的自监督三维重建方法及系统。针对亮度一致性歧义问题,引入抽象的语义线索以及在自监督信号中嵌入数据增强机制,增强了自监督信号在噪声扰动下的可靠性。The present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Aiming at the brightness consistency ambiguity problem, abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the reliability of the self-supervised signal under noise disturbance.
本发明提出的自监督训练方法超越了传统的无监督方法,并能与一些领先的有监督方法取得相当的效果。The self-supervised training method proposed by the present invention surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods.
基于协同分割的语义一致性损失,动态地从多视图对中通过聚类挖掘出共有语义信息部件。Based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering.
数据增强一致性损失将自监督的分支扩展为双流结构,使用标准分支的预测结果作为伪标签,监督数据增强分支的预测结果,将数据增强对比一致性与亮度一致性假设解缠,分别进行处理,实现在自监督信号中引入大量的数据增强扩充训练集中的变化。The data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set.
整个流程无需任何标签数据,不依赖于真值标注,而是从数据本身挖掘出有效信息实现网络的训练,极大节约了成本,缩短了重建进程。The whole process does not require any label data and does not rely on the true value labeling. Instead, effective information is mined from the data itself to implement network training, which greatly saves costs and shortens the reconstruction process.
将深度预测、协同分割以及数据增强融合到一起,在解决了显存溢出问题地基础上,提升了自监督信号的精度,使本实施例具备更好的泛化性。Integrating depth prediction, collaborative segmentation and data enhancement together, on the basis of solving the problem of video memory overflow, the accuracy of the self-supervised signal is improved, so that this embodiment has better generalization.
为使本发明的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, preferred embodiments are given below, and are described in detail as follows in conjunction with the accompanying drawings.
附图说明Description of drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present invention, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.
图1是本发明实施例1的自监督三维重建方法流程图;1 is a flowchart of a self-supervised three-dimensional reconstruction method according to Embodiment 1 of the present invention;
图2是本发明实施例1的输入数据处理流程图;Fig. 2 is the input data processing flow chart of Embodiment 1 of the present invention;
图3是本发明实施例1的深度估计处理流程图;3 is a flowchart of depth estimation processing in Embodiment 1 of the present invention;
图4是本发明实施例1的深度估计处理原理图;4 is a schematic diagram of a depth estimation process according to Embodiment 1 of the present invention;
图5是本发明实施例1的协同分割处理流程图;5 is a flowchart of the collaborative segmentation process in Embodiment 1 of the present invention;
图6是本发明实施例1的协同分割处理原理图;6 is a schematic diagram of a collaborative segmentation process in Embodiment 1 of the present invention;
图7是本发明实施例1的数据增强处理流程图;Fig. 7 is the data enhancement processing flow chart of Embodiment 1 of the present invention;
图8是本发明实施例1的数据增强处理原理图;8 is a schematic diagram of a data enhancement process according to Embodiment 1 of the present invention;
图9是本发明实施例1的实验检测结果图;Fig. 9 is the experimental detection result diagram of the embodiment of the present invention 1;
图10是本发明实施例1的一个三维重建结果图;10 is a three-dimensional reconstruction result diagram of Embodiment 1 of the present invention;
图11是本发明实施例1的另一个三维重建结果图;11 is another three-dimensional reconstruction result diagram of Embodiment 1 of the present invention;
图12是本发明实施例2的系统模块图;12 is a system module diagram of Embodiment 2 of the present invention;
图13是本发明实施例2的系统具体结构图。FIG. 13 is a specific structural diagram of a system according to Embodiment 2 of the present invention.
附图标记:Reference number:
1-输入单元;2-深度处理单元;3-双支处理单元;4-损失函数构建单元;5-输出单元;11-输入数据获取单元;12-转换单元;13-筛选单元;14-预处理单元;21-深度图像获取单元;22-回归损失获取单元;23-光度损失 获取单元;31-协同分割单元;311-分割图像获取单元;312-交叉熵损失获取单元;313-语义损失获取单元;32-数据增强单元;321-数据处理单元;322-数据损失获取单元;323-数据一致性损失获取单元。1-input unit; 2-depth processing unit; 3-dual branch processing unit; 4-loss function construction unit; 5-output unit; 11-input data acquisition unit; 12-transformation unit; 13-screening unit; 14-pre-processing processing unit; 21-depth image acquisition unit; 22-regression loss acquisition unit; 23-luminosity loss acquisition unit; 31-cooperative segmentation unit; 311-segmented image acquisition unit; 312-cross entropy loss acquisition unit; 313-semantic loss acquisition unit; 32 - data enhancement unit; 321 - data processing unit; 322 - data loss acquisition unit; 323 - data consistency loss acquisition unit.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
现有的自监督三维重建算法中往往都是直接将不同视角的图像通过预测的深度图投影到参考视角,如果深度图足够可靠那么重投影的重建图像应该与实际的参考视角的原图像尽可能相似。在这个过程中,默认整个场景都服从于亮度一致性假设(Color constancy hypothesis),即:不同视角的匹配点具有相同的颜色。但是,在现实场景下,相机所拍摄的多视角图像不可避免地会存在各种干扰因素,如光照、噪声等等,导致不同视角的匹配点颜色分布有差异。然而在这种情况下,亮度一致性假设(Color constancy hypothesis)就不再有效,从而导致自监督信号本身就不再有效。最后,整个训练过程中,不可靠的自监督信号无法起到很好的监督作用,导致自监督方法训练出来的模型跟有监督方法相比不可避免地具有较大差异。这个问题被称为亮度一致性歧义问题。如果只进行常规训练,由于亮度一致性歧义,会导致模型在边缘区域模糊,且在很多区域都存在过平滑的问题。只有在数据量很大的情况下,或者相对比较理想的场景下,常规的自监督训练才可能不受到亮度一致性歧义问题的影响,并取得相当的效果。Existing self-supervised 3D reconstruction algorithms often directly project images from different perspectives to the reference perspective through the predicted depth map. If the depth map is reliable enough, the reprojected reconstructed image should be as close as possible to the original image of the actual reference perspective. resemblance. In this process, by default the entire scene is subject to the Color constancy hypothesis, that is, matching points from different viewing angles have the same color. However, in a real scene, the multi-view images captured by the camera will inevitably have various interference factors, such as illumination, noise, etc., which lead to differences in the color distribution of matching points of different viewing angles. In this case, however, the Color constancy hypothesis is no longer valid, resulting in the self-supervision signal itself no longer valid. Finally, in the whole training process, unreliable self-supervised signals cannot play a good role in supervision, resulting in models trained by self-supervised methods that are inevitably different from supervised methods. This problem is called the luminance consistency ambiguity problem. If only regular training is performed, due to the ambiguity of brightness consistency, the model will be blurred in edge areas, and there will be over-smoothing problems in many areas. Only in the case of a large amount of data, or in relatively ideal scenarios, can conventional self-supervised training not be affected by the brightness consistency ambiguity problem and achieve comparable results.
亮度一致性歧义问题是无/自监督三维重建方法中的核心问题。因此, 只有解决亮度一致性歧义问题,才可突破无/自监督三维重建方法的限制。The luminance consistency ambiguity problem is the core problem in unsupervised/self-supervised 3D reconstruction methods. Therefore, the limitation of un/self-supervised 3D reconstruction methods can be overcome only by solving the problem of brightness consistency ambiguity.
本发明针对亮度一致性歧义问题,提出了一种基于协同分割与数据增强的自监督三维重建方法及系统,通过引入抽象的语义线索以及在自监督信号中嵌入数据增强机制以增强自监督信号在噪声扰动下的可靠性,既能解决传统三维重建方法存在的细节损失、容易收到噪声光照干扰、过度依赖训练数据等问题,也能解决常规无/自监督三维重建方法的缺陷,超越了传统的无/自监督方法并能与一些高效的有监督方法取得相当的效果,且整个过程无需任何标注。Aiming at the problem of brightness consistency ambiguity, the present invention proposes a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Reliability under noise disturbance can not only solve the problems of traditional 3D reconstruction methods such as loss of details, easy to receive noise and light interference, over-reliance on training data, etc., but also solve the defects of conventional un/self-supervised 3D reconstruction methods, surpassing the traditional The unsupervised/self-supervised method can achieve comparable results with some efficient supervised methods, and the whole process does not require any annotation.
实验证明,本发明提供的自监督三维重建方法,在DTU数据集上超过了传统的无监督三维重建方法,并且能够实现与最先进的有监督方法相当的效果。此外,在不做任何微调的前提下,直接将本发明最终获取的无监督训练的模型应用在Tanks&Temples数据集上,也能超过传统的无监督方法。由于Tanks&Temples数据集本身包含了大量特殊的自然场景的光照变化,从侧面说明了本发明相比其他无监督方法具有较好的泛化性。Experiments show that the self-supervised 3D reconstruction method provided by the present invention surpasses the traditional unsupervised 3D reconstruction method on the DTU data set, and can achieve an effect comparable to the state-of-the-art supervised method. In addition, without any fine-tuning, the unsupervised training model finally obtained by the present invention can be directly applied to the Tanks&Temples dataset, which can also surpass the traditional unsupervised method. Since the Tanks&Temples dataset itself contains a large number of illumination changes of special natural scenes, it shows from the side that the present invention has better generalization than other unsupervised methods.
需要说明的是,本发明在采集样本数据时尽可能贴近真实场景下的光照效果,还原各类场景下的噪声干扰及颜色扰动,尽可能地模拟出各类自然场景,样本因此具有很强的代表性。而本发明能够适用于各种泛化的场景,相比于常规的自监督三维重建方法具有更强的针对性和更广的适用范围。It should be noted that the present invention is as close as possible to the lighting effect in the real scene when collecting sample data, restores the noise interference and color disturbance in various scenes, and simulates various natural scenes as much as possible, so the sample has a strong effect. representative. However, the present invention can be applied to various generalized scenarios, and has stronger pertinence and wider scope of application than conventional self-supervised three-dimensional reconstruction methods.
需要说明的是,本申请中的参考视图,包括深度估计处理、协同分割处理和数据增强处理所用的参考视图相同。一般来说,N个多视图中,每个视角都要构建一次多视角对,根据哪个视角构建多视角对,哪个视角就是参考视角。最后就会有N个多视角对。It should be noted that the reference views in this application include the same reference views used in the depth estimation process, the collaborative segmentation process and the data enhancement process. Generally speaking, among the N multi-views, a multi-view pair must be constructed once for each viewpoint, and the multi-view pair is constructed according to which viewpoint, and which viewpoint is the reference viewpoint. Finally, there will be N multi-view pairs.
实施例1Example 1
本实施例提出了一种基于协同分割与数据增强的自监督三维重建方法,如说明书附图1-11所示。流程步骤如说明书附图1,具体方案如下:This embodiment proposes a self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement, as shown in Figures 1-11 of the specification. The process steps are as shown in accompanying drawing 1 of the description, and the specific scheme is as follows:
S1、获取输入数据,根据输入数据获取具有重合区域且视角相似的多视角图像对;S1, obtain input data, and obtain multi-view image pairs with overlapping regions and similar viewing angles according to the input data;
S2、通过对多视角图像对进行深度估计处理,获取光度一致性损失;S2. Obtain the loss of photometric consistency by performing depth estimation processing on the multi-view image pair;
S3、通过对多视角图像对进行协同分割处理,获取语义一致性损失,通过对多视角图像对进行数据增强处理,获取数据增强一致性损失,协同分割处理和数据增强处理并行运行;S3. Obtain the semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair, obtain the data enhancement consistency loss by performing data enhancement processing on the multi-view image pair, and run the collaborative segmentation processing and the data enhancement processing in parallel;
S4、根据光度一致性损失、语义一致性损失和数据增强一致性损失构建损失函数;S4. Construct a loss function according to the loss of photometric consistency, the loss of semantic consistency and the loss of data enhancement consistency;
S5、根据损失函数构建并训练神经网络模型,基于神经网络模型获取输入数据的三维模型。S5. Construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.
在本实施例中,步骤S1获取输入数据,根据输入数据获取具有重合区域且视角相似的多视角图像对。步骤S1流程如说明书附图2所示,具体包括:In this embodiment, step S1 acquires input data, and acquires multi-view image pairs having overlapping regions and similar viewing angles according to the input data. The process of step S1 is shown in Figure 2 of the specification, and specifically includes:
S11、获取输入数据,输入数据包括图像或视频;S11. Obtain input data, where the input data includes images or videos;
S12、判断输入数据是否为图像:若是,则在输入数据中选取多视角图像;若否,则将输入数据转换为多视角图像;S12, determine whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;
S13、根据多视角图像获取视角相似且具有相同区域的多视角图像对;S13. Acquire pairs of multi-view images with similar viewing angles and having the same area according to the multi-view images;
S14、对多视角图像对进行图像预处理。S14. Perform image preprocessing on the multi-view image pair.
具体地,原始多视角图像的数据采集,可以通过任意相机在各种不同视角下拍摄图像或是直接在相机移动过程中拍摄一段视频完成,本实施例的输入数据既可以是图像或视频,也可以是图像结合视频。如果是图像,仅需要从输入数据中提取多视角图像,从多视角图像中筛选出视角相似且具有相同区域的多视角图像对,最后通过基本的图像预处理如图像滤波等技术增强图像质量即可;如果是视频,则需要先将视频转换成多视角图像, 从多视角图像中筛选出视角相似且具有相同区域的多视角图像对,再进行图像预处理。Specifically, the data collection of the original multi-view image can be completed by capturing images with any camera at various viewing angles or directly capturing a video while the camera is moving. The input data in this embodiment may be images or videos, or It can be an image combined with a video. If it is an image, it is only necessary to extract multi-view images from the input data, filter out multi-view image pairs with similar viewing angles and the same area from the multi-view images, and finally enhance the image quality through basic image preprocessing techniques such as image filtering. Yes; if it is a video, you need to convert the video into a multi-view image first, and filter out multi-view image pairs with similar viewing angles and the same area from the multi-view images, and then perform image preprocessing.
特别地,步骤S13选取多视角图像对具体包括:通过二维尺度不变图像特征对所述多视角图像进行特征匹配,获取像素点的匹配信息和图像特征的匹配程度;In particular, the step S13 of selecting a pair of multi-view images specifically includes: performing feature matching on the multi-view images by using two-dimensional scale-invariant image features, and obtaining matching information of pixels and matching degrees of image features;
根据所述匹配信息获取相机外参矩阵,根据所述匹配程度计算图像之间的视角重合程度,并对所述视角重合程度进行排序,获取与每个视角接近的其它视角。The camera extrinsic parameter matrix is obtained according to the matching information, the degree of overlapping of viewing angles between the images is calculated according to the degree of matching, and the overlapping degrees of viewing angles are sorted to obtain other viewing angles that are close to each viewing angle.
具体地,在获取多视角图像对之后,通过SIFT、ORB、SURF等二维尺度不变图像特征对所有多视角图像对两两之间进行特征匹配。依靠二维的图像像素点匹配信息,求解所有相机之间的光束法平差问题,计算得到不同相机之间的相对位姿关系,即相机外参矩阵。此外,根据图像特征子的匹配程度计算两两成对的所有图像对之间的视角重合程度。依照重合程度进行排序,得到针对每个视角而言剩余所有视角中与该视角最接近的前10个视角。由此可以将N个视角的多视图划分为N组多视角图像对,用于后续的立体视觉匹配过程。Specifically, after acquiring the multi-view image pairs, feature matching is performed between all multi-view image pairs by two-dimensional scale-invariant image features such as SIFT, ORB, and SURF. Relying on the two-dimensional image pixel point matching information, the beam method adjustment problem between all cameras is solved, and the relative pose relationship between different cameras is calculated, that is, the camera extrinsic parameter matrix. In addition, the degree of coincidence of viewing angles between all pairs of image pairs is calculated according to the matching degree of the image features. Sort according to the degree of overlap, and obtain the top 10 viewing angles that are closest to the viewing angle among all remaining viewing angles for each viewing angle. Thereby, the multi-view of N viewing angles can be divided into N groups of multi-view image pairs, which are used for the subsequent stereo vision matching process.
在本实施例中,多视角图像对一般包括3-7张多视角图像,选取视角相似且具有重合区域的多视角图像对可方便后续的特征匹配。需要说明的是,如果视角差异过大,重合区域过小,后续流程找匹配点时有效区域会非常小,影响流程的进行。In this embodiment, the multi-view image pair generally includes 3-7 multi-view images, and selecting multi-view image pairs with similar viewing angles and overlapping regions can facilitate subsequent feature matching. It should be noted that if the viewing angle difference is too large and the overlapping area is too small, the effective area will be very small when the subsequent process finds matching points, which will affect the progress of the process.
在本实施例中,S2通过对多视角图像对进行深度估计处理,获取光度一致性损失,具体流程如说明书附图3所示,包括:In this embodiment, S2 obtains the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair, and the specific process is shown in FIG. 3 of the specification, including:
S21、基于深度估计网络对多视角图像进行深度估计,获取深度图像;S21. Perform depth estimation on a multi-view image based on a depth estimation network to obtain a depth image;
S22、获取参考视角和非参考视角,通过单应性映射将非参考视角上的深度图像进行重建得到重投影视图像,并根据重投影视图像计算回归损失;S22, obtaining a reference viewing angle and a non-reference viewing angle, reconstructing the depth image on the non-reference viewing angle through homography to obtain a reprojected viewing image, and calculating a regression loss according to the reprojecting viewing image;
S23、根据回归损失获取光度一致性损失。S23. Obtain the photometric consistency loss according to the regression loss.
深度估计处理是现有三维重建方法中常用的技术手段。具体流程包括:将多视角图像对和参考视图输入到深度估计网络进行深度估计,可获得深度图,将深度图和多视角图像对进行单应性映射,对非参考视角上的深度图像进行重建得到重投影视图像,通过计算重投影视图和参考视图之间的差异可获取回归损失,即L2损失,基于L2损失获取光度一致性误差。具体原理如说明书附图4所示。Depth estimation processing is a commonly used technical means in existing 3D reconstruction methods. The specific process includes: inputting the multi-view image pair and the reference view to the depth estimation network for depth estimation, obtaining a depth map, performing homography mapping between the depth map and the multi-view image pair, and reconstructing the depth image from the non-reference view. The reprojected view image is obtained, and the regression loss, ie the L2 loss, can be obtained by calculating the difference between the reprojected view and the reference view, and the photometric consistency error is obtained based on the L2 loss. The specific principle is shown in Figure 4 of the description.
在本实施例中,S3通过对多视角图像对进行协同分割处理,获取语义一致性损失,通过对多视角图像对进行数据增强处理,获取数据增强一致性损失,协同分割处理和数据增强处理并行运行。步骤S3是本实施例的核心步骤,通过协同分割处理和数据增强处理两个分支并行运行,获取语义一致性损失和数据增强一致性损失。In this embodiment, S3 obtains the semantic consistency loss by performing collaborative segmentation processing on the multi-view image pairs, and obtains the data enhancement consistency loss by performing data enhancement processing on the multi-view image pairs, and the collaborative segmentation processing and the data enhancement processing are performed in parallel run. Step S3 is the core step of this embodiment. The two branches of the collaborative segmentation processing and the data enhancement processing are run in parallel to obtain the semantic consistency loss and the data enhancement consistency loss.
其中,基于协同分割的语义一致性损失,动态地从多视图对中通过聚类挖掘出共有语义信息部件,无需真值标签,可以泛化到任意场景,对场景内多视图的共有信息进行无监督聚类,不需要依靠人为定义的语义类别。而现有的基于语义一致性的方案往往都需要通过大量的人工标注来获得语义标注,成本非常高;此外这些方法还受限与特定的场景,和特定的人为定义的语义类别,无法适用于任意场景。协同分割处理具体流程如说明书附图5所示,包括:Among them, based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering, without the need for ground-truth labels, and can be generalized to any scene. Supervised clustering without relying on human-defined semantic categories. However, the existing solutions based on semantic consistency often need to obtain semantic annotations through a large number of manual annotations, which is very costly; in addition, these methods are also limited to specific scenarios and specific artificially defined semantic categories, and cannot be applied to any scene. The specific process of collaborative segmentation processing is shown in Figure 5 of the description, including:
S311、通过非负矩阵对多视角图像对进行协同分割,获取协同分割图像;S311. Perform collaborative segmentation on the multi-view image pair by using a non-negative matrix to obtain a collaboratively segmented image;
S312、获取参考视角和非参考视角,通过单应性映射将非参考视角上的协同分割图像进行重建得到重投影协同分割图像,并计算重投影协同分割图像与参考视角上的协同分割图像之间的交叉熵损失;S312: Obtain a reference perspective and a non-reference perspective, reconstruct the collaboratively segmented image on the non-reference perspective through homography to obtain a reprojected collaboratively segmented image, and calculate the difference between the reprojected collaboratively segmented image and the collaboratively segmented image on the reference perspective The cross entropy loss of ;
S313、根据交叉熵损失获取语义一致性损失。S313. Obtain the semantic consistency loss according to the cross-entropy loss.
协同分割处理流程包括:将参考视图和多视角图像对输入到预训练的VGG网络,接着进行非负矩阵分解,获取参考视角下的协同分割图像和非 参考视角下的协同分割图像,对非参考视角下的协同分割图像进行单应性投影获取重投影协同分割图像,计算重投影协同分割图像与参考视角下的协同分割图像之间的交叉熵损失,进而获取语义一致性误差。具体流程如说明书附图6所示。The collaborative segmentation processing flow includes: inputting the reference view and multi-view image pair into the pre-trained VGG network, and then performing non-negative matrix decomposition to obtain the collaborative segmentation image in the reference perspective and the collaborative segmentation image in the non-reference perspective. The co-segmented images in the perspective are homographed to obtain the re-projected co-segmented images, and the cross-entropy loss between the re-projected co-segmented images and the co-segmented images in the reference perspective is calculated, and then the semantic consistency error is obtained. The specific process is shown in Figure 6 of the description.
在本实施例中,协同分割处理与步骤S2的深度估计处理类似。将参考视图和多视角图像对输入到预训练的卷积神经网络。特别地,多视角图像对中的每张图像都会被送入一个共享权重的卷积神经网络提取特征,优选地,卷积神经网络选用ImageNet预训练的VGG网络。由此,每个视角的图像都会得到一个对应的特征图张量,特征图张量的维度是H×W×C,其中,H和W为图像的高和宽,C为卷积神经网络中卷积层的通道数。所有视角的特征图张量被展开并凭借到一起构成一个二维矩阵,即特征图矩阵A∈R V×H×W×C,其维度是V×H×W×C,其中V是总视角数。通过链式迭代式对所述特征图矩阵进行非负矩阵分解,求得第一非负矩阵P和第二非负矩阵Q。第一非负矩阵P和所述第二非负矩阵Q的表达式分别为: In this embodiment, the cooperative segmentation process is similar to the depth estimation process of step S2. The reference view and multi-view image pairs are input to a pretrained convolutional neural network. In particular, each image in the multi-view image pair will be fed into a weighted convolutional neural network to extract features, preferably, the convolutional neural network is a VGG network pre-trained by ImageNet. Thus, the image of each view will get a corresponding feature map tensor, and the dimension of the feature map tensor is H×W×C, where H and W are the height and width of the image, and C is the convolutional neural network. The number of channels in the convolutional layer. The feature map tensors of all views are expanded and taken together to form a two-dimensional matrix, that is, the feature map matrix A∈R V×H×W×C , whose dimension is V×H×W×C, where V is the total view number. The feature map matrix is subjected to non-negative matrix decomposition through chain iteration to obtain a first non-negative matrix P and a second non-negative matrix Q. The expressions of the first non-negative matrix P and the second non-negative matrix Q are respectively:
P∈R V×H×W×K,Q∈R C×K P∈R V×H×W×K , Q∈R C×K
K表示非负矩阵分解过程中的P矩阵的列数,也是Q矩阵的行数。由于非负矩阵的正交约束假设,要求其中的Q矩阵必须为满足以下条件:QQ T=I,其中,I为单位矩阵。由于正交约束的限制,Q矩阵的每行向量都需要同时包含可能多的A矩阵的信息,且保持尽可能地不重合。换句话说,Q矩阵的每行向量可以近似地看做聚类的簇中心,而非负矩阵分解求解的过程也可以看做聚类的过程。相应地,P矩阵表示的就是所有多视角图像的每个像素针对语义上的聚类簇中心(Q矩阵每行的向量)的相关程度,即分割置信度。由此实现不依靠任何监督信号实现多视角图像的协同分割,提取得到多视角图像的共有语义信息。非负矩阵分解实现协同分割提取共有语义信息示意图如说明书附图所示。 K represents the number of columns of the P matrix in the non-negative matrix decomposition process, and is also the number of rows of the Q matrix. Due to the orthogonal constraint assumption of non-negative matrices, it is required that the Q matrix in it must satisfy the following conditions: QQ T =I, where I is the identity matrix. Due to the constraints of orthogonality, each row vector of the Q matrix needs to contain as much information as possible of the A matrix at the same time, and keep it as disjoint as possible. In other words, each row vector of the Q matrix can be approximately regarded as the cluster center of the cluster, and the process of solving the non-negative matrix decomposition can also be regarded as the process of clustering. Correspondingly, the P matrix represents the correlation degree of each pixel of all multi-view images with respect to the semantic cluster center (the vector of each row of the Q matrix), that is, the segmentation confidence. In this way, the collaborative segmentation of multi-view images is realized without relying on any supervision signal, and the common semantic information of multi-view images is extracted. The schematic diagram of the non-negative matrix decomposition to achieve collaborative segmentation and extraction of common semantic information is shown in the accompanying drawings of the description.
将第一非负矩阵转换为与图像维度对应的格式,获取协同分割图像。 协同分割图像S的表达式为:Convert the first non-negative matrix to a format corresponding to the image dimension to obtain a collaboratively segmented image. The expression of the collaboratively segmented image S is:
S∈R V×H×W×K S∈R V×H×W×K
其中,V为总视角数,H和W为图像的高和宽,也是第二非负矩阵Q的行数,R为实数。Among them, V is the total number of viewing angles, H and W are the height and width of the image, and are also the number of rows of the second non-negative matrix Q, and R is a real number.
需要说明的是,在协同分割分支中,为了兼顾计算量和效率,只是采用了一个较为简单的传统方案进行协同分割任务。但是在协同分割领域其实还存在较多的替代方案,本实施例可以通过其他的聚类算法来做协同分割任务,实现相当的效果。It should be noted that, in the cooperative segmentation branch, in order to take into account the amount of calculation and efficiency, only a relatively simple traditional scheme is adopted to perform the cooperative segmentation task. However, in the field of collaborative segmentation, there are actually many alternative solutions. In this embodiment, other clustering algorithms can be used to perform the collaborative segmentation task to achieve a comparable effect.
特别地,在非负矩阵分解实现时,由于方法本身存在缺陷,在处理真实场景的多视图经常求解失败。而这一问题很大程度是因为迭代式求解的过程高度依赖于随机初始化状态值,一旦没有碰到较好的初始值,整个非负矩阵分解的求解无法收敛,协同分割也会失败,最后导致整个训练过程无法进行。本实施例将原始的迭代式求解过程扩展为了多分支并行求解的过程,每次随机初始化多组解,从中选取最优的再送入下次迭代过程中。很大程度上回避了由于随机初始化值不好而导致求解失败的问题。In particular, when non-negative matrix factorization is implemented, the solution often fails when dealing with multiple views of real scenes due to the flaws in the method itself. This problem is largely due to the fact that the iterative solution process is highly dependent on the random initialization state value. Once a good initial value is not encountered, the solution of the entire non-negative matrix decomposition cannot be converged, and the cooperative segmentation will also fail, and finally lead to The entire training process cannot be performed. In this embodiment, the original iterative solution process is extended to a multi-branch parallel solution process, and multiple sets of solutions are randomly initialized each time, and the optimal solution is selected and sent to the next iteration process. To a large extent, the problem of solution failure due to bad random initialization values is avoided.
此外,由于语义分割任务的特殊性,往往需要限定特定的场景和可能的语义类别。而本实施例只需要挖掘不同视图中的共有语义部件(聚类簇),不再需要关心特定的场景和语义标签。因此,本实施例提供的方法可以泛化到任意动态变化的场景,而不需要像其他方法一样需要大量繁琐昂贵的语义标注工作。Furthermore, due to the particularity of semantic segmentation tasks, it is often necessary to define specific scenarios and possible semantic categories. However, this embodiment only needs to mine common semantic components (clusters) in different views, and no longer needs to care about specific scenes and semantic labels. Therefore, the method provided in this embodiment can be generalized to any dynamically changing scene without requiring a lot of tedious and expensive semantic annotation work like other methods.
在本实施例中,S312具体包括:将V个视图划分为一个参考视角与一系列非参考视角组成的视角对,参考视角下的协同分割图像S 1和非参考视角下的协同分割图像S i的表达式分别为: In this embodiment, S312 specifically includes: dividing the V views into view pairs consisting of a reference view and a series of non-reference views, a collaboratively segmented image S 1 under the reference view and a collaboratively segmented image S i under the non-reference view The expressions are:
S 1∈R H×W×K,S i∈R H×W×K S 1 ∈R H×W×K , S i ∈R H×W×K
默认序号为1的视角为参考视角,而序号为i的视角定义为非参考视角,其中,2≤i≤V。根据相机内参和外参矩阵$(K,T)$,由单应性公式可以计算 参考视角下位置为p j的像素与源视角中位置为
Figure PCTCN2021137980-appb-000015
的像素的对应关系:
The viewing angle with the default sequence number 1 is the reference viewing angle, and the viewing angle with the sequence number i is defined as a non-reference viewing angle, where 2≤i≤V. According to the camera intrinsic and extrinsic parameter matrix $(K,T)$, the homography formula can calculate the pixel at the position of p j under the reference view and the position in the source view as
Figure PCTCN2021137980-appb-000015
The corresponding relationship of the pixels:
Figure PCTCN2021137980-appb-000016
Figure PCTCN2021137980-appb-000016
其中,p j为像素在参考视角下的位置,
Figure PCTCN2021137980-appb-000017
为像素在非参考视角下的位置,j表示图像或分割图中像素的索引值,且1≤j≤H×W,D表示网络预测出的深度图。
Among them, p j is the position of the pixel under the reference viewing angle,
Figure PCTCN2021137980-appb-000017
is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image or segmentation map, and 1≤j≤H×W, D represents the depth map predicted by the network.
接着,根据单应性映射公式和双线性插值策略,非参考视角下的协同分割图像S i可以投影到参考视角下重投影协同分割图像
Figure PCTCN2021137980-appb-000018
表达式为:
Then, according to the homography mapping formula and the bilinear interpolation strategy, the co-segmented image Si in the non-reference view can be projected to the reference view to reproject the co-segmented image
Figure PCTCN2021137980-appb-000018
The expression is:
Figure PCTCN2021137980-appb-000019
Figure PCTCN2021137980-appb-000019
通过计算重投影协同分割图像
Figure PCTCN2021137980-appb-000020
与参考视角下的协同分割图像的差异,可以得到交叉熵损失f(S 1,j),f(S 1,j)的表达式为:
Co-segmentation of images by computational reprojection
Figure PCTCN2021137980-appb-000020
The difference from the co-segmented image in the reference view, the cross entropy loss f(S 1,j ) can be obtained, and the expression of f(S 1,j ) is:
f(S 1,j)=onehot(argmax(S 1,j)) f(S 1,j )=onehot(argmax(S 1,j ))
根据交叉熵损失获取语义一致性误差,语义一致性误差L SC,i表达式为: The semantic consistency error is obtained according to the cross entropy loss, the semantic consistency error L SC, i is expressed as:
Figure PCTCN2021137980-appb-000021
Figure PCTCN2021137980-appb-000021
其中,M i表示的是从参考视角单应性投影映射到参考视角的有效区域。 Among them, M i represents the effective area mapped from the reference viewing angle homography to the reference viewing angle.
对所有视角对间都计算重建语义分割图与原始语义分割图的交叉熵损失。如果预测的深度图是正确的,那么根据其重建的语义分割图也应该与原始语义分割图尽可能相似。整个语义一致性损失的计算公式如下:The cross-entropy loss between the reconstructed semantic segmentation map and the original semantic segmentation map is calculated for all view pairs. If the predicted depth map is correct, the semantic segmentation map reconstructed from it should also be as similar as possible to the original semantic segmentation map. The calculation formula of the entire semantic consistency loss is as follows:
Figure PCTCN2021137980-appb-000022
Figure PCTCN2021137980-appb-000022
其中,f(S 1,j)为交叉熵损失,L SC为语义一致性误差,M i表示的是从非参考视角单应性投影映射到参考视角的有效区域,N为自然数集,j表示图像中像素的索引值,H和W为图像的高和宽,
Figure PCTCN2021137980-appb-000023
为重投影协同分割图像,S 1为参考视角下的协同分割图像,i为非参考视角。
Among them, f(S 1,j ) is the cross-entropy loss, L SC is the semantic consistency error, Mi represents the effective area mapped from the non-reference view homography to the reference view, N is the set of natural numbers, and j represents The index value of the pixel in the image, H and W are the height and width of the image,
Figure PCTCN2021137980-appb-000023
In order to reproject the co-segmented image, S 1 is the co-segmented image under the reference viewing angle, and i is the non-reference viewing angle.
在本实施例中,训练时语义一致性损失的权重默认设置为0.1。In this embodiment, the weight of semantic consistency loss during training is set to 0.1 by default.
由于数据增强操作本身会导致多视角图像的像素值发生变化,因此直接应用数据增强策略可能会破坏自监督信号的亮度一致性假设。不同于有监督方法的真值标签,自监督信号来自于数据本身,更容易受到数据本身的噪声干扰。为了使数据增强策略引入自监督训练框架,将原始的自监督训练分支拓展为双流结构,一个标准分支仅有光度立体视觉自监督信号监督,而另一个分支则引入各种随机数据增强变化。Since the data augmentation operation itself causes the pixel values of multi-view images to change, directly applying the data augmentation strategy may break the luminance consistency assumption of self-supervised signals. Different from the ground-truth labels of supervised methods, self-supervised signals come from the data itself and are more susceptible to noise interference from the data itself. In order to introduce the data augmentation strategy into the self-supervised training framework, the original self-supervised training branch is expanded into a dual-stream structure, one standard branch is only supervised by photometric stereo vision self-supervised signals, while the other branch introduces various random data augmentation changes.
其中,数据增强一致性损失将自监督的分支扩展为双流结构,使用标准分支的预测结果作为伪标签,监督数据增强分支的预测结果,将数据增强对比一致性与亮度一致性假设解缠,分别进行处理,实现在自监督信号中引入大量的数据增强扩充训练集中的变化。而现有的基于光度立体一致性的自监督信号往往受限于亮度一致性假设,不允许数据增强操作。因为数据增强会改变图像的像素分布,导致亮度一致性假设受到破坏,反过来导致亮度一致性歧义,使得自监督信号不够可靠。数据增强处理具体流程如说明书附图7所示,包括:Among them, the data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions, respectively. Processed to introduce a large amount of data augmentation into the self-supervised signal to augment the changes in the training set. However, existing self-supervised signals based on photometric stereo consistency are often limited by the luminance consistency assumption and do not allow data augmentation operations. Because data augmentation changes the pixel distribution of the image, the luminance consistency assumption is broken, which in turn leads to luminance consistency ambiguity, making the self-supervised signal less reliable. The specific process of data enhancement processing is shown in Figure 7 of the description, including:
S321、采用不同的数据增强策略对多视角图像对进行数据增强处理;S321, using different data enhancement strategies to perform data enhancement processing on the multi-view image pair;
S322、以深度图像为伪标签对数据增强处理后的多视角损失图像对进行监督,获取不同数据增强策略下的数据损失;S322, using the depth image as a pseudo-label to supervise the multi-view loss image pair after data enhancement processing, and obtain the data loss under different data enhancement strategies;
S323、根据数据损失获取数据增强一致性损失。S323. Obtain the data enhancement consistency loss according to the data loss.
数据增强处理具体流程包括:将参考视图和多视角图像对输入到深度估计网络进行深度估计处理,获取深度图,根据深度图获取有效区域掩码,将有效区域掩码作为伪标签。对参考视图和多视角图像对进行随机数据增强后输入到深度估计网络进行深度估计处理获取对比深度图,计算对比深度图和伪标签之间的差异,进而获取数据增强一致性损失。数据增强处理原理如说明书附图8所示。The specific process of data enhancement processing includes: inputting the reference view and multi-view image pairs to the depth estimation network for depth estimation processing, obtaining a depth map, obtaining an effective area mask according to the depth map, and using the effective area mask as a pseudo-label. After performing random data enhancement on the reference view and multi-view image pairs, they are input to the depth estimation network for depth estimation processing to obtain the contrast depth map, and the difference between the contrast depth map and the pseudo-label is calculated, and then the data enhancement consistency loss is obtained. The principle of data enhancement processing is shown in Figure 8 of the description.
在本实施例中,数据增强策略包括随机遮挡掩码、伽马校正、颜色扰动和随机噪声。原始的多视图为I,而作用在多视角图像对上的数据增强函 数为τ θ,数据增强后的多视图为
Figure PCTCN2021137980-appb-000024
θ表示数据增强过程中与具体操作相关的参数。受限于多视角几何的视角约束,不能改变像素位置的分布,否则可能破坏标定相机之间的对应关系。所采用的数据增强分别为:随机遮挡掩码
Figure PCTCN2021137980-appb-000025
伽马校正
Figure PCTCN2021137980-appb-000026
颜色扰动和随机噪声
Figure PCTCN2021137980-appb-000027
In this embodiment, data augmentation strategies include random occlusion masks, gamma correction, color perturbation, and random noise. The original multi-view is I, and the data enhancement function acting on the multi-view image pair is τ θ , and the multi-view after data enhancement is
Figure PCTCN2021137980-appb-000024
θ represents the parameters related to specific operations in the data augmentation process. Limited by the viewing angle constraints of multi-view geometry, the distribution of pixel positions cannot be changed, otherwise the correspondence between the calibrated cameras may be destroyed. The data enhancements used are: random occlusion masks
Figure PCTCN2021137980-appb-000025
Gamma Correction
Figure PCTCN2021137980-appb-000026
Color perturbation and random noise
Figure PCTCN2021137980-appb-000027
随机遮挡掩码
Figure PCTCN2021137980-appb-000028
为模仿多视角下的前景遮挡情景,可以随机生成一个二进制掩码遮挡掩码
Figure PCTCN2021137980-appb-000029
参考视角下的一部分区域,而
Figure PCTCN2021137980-appb-000030
表示剩下的在预测中有效的区域。而
Figure PCTCN2021137980-appb-000031
所包含的区域对于遮挡变化应当保持不变性,所以整个系统应该在这种人为制造的遮挡边缘上保持不变性,由此便可以引导模型更多地关注遮挡边缘的处理。
random occlusion mask
Figure PCTCN2021137980-appb-000028
In order to simulate the foreground occlusion scenario in multi-view, a binary mask occlusion mask can be randomly generated
Figure PCTCN2021137980-appb-000029
A part of the area under the reference view, and
Figure PCTCN2021137980-appb-000030
Indicates the remaining regions that are valid in prediction. and
Figure PCTCN2021137980-appb-000031
The included area should remain invariant to occlusion changes, so the entire system should remain invariant on such artificial occlusion edges, which can guide the model to pay more attention to the processing of occlusion edges.
伽马校正
Figure PCTCN2021137980-appb-000032
伽马校正是一个常见的被用来调整图像光照的数据增强操作。为了模拟尽可能多且复杂的光照变化情况,引入了随机伽马校正来进行数据增强。
Gamma Correction
Figure PCTCN2021137980-appb-000032
Gamma correction is a common data augmentation operation used to adjust image lighting. To simulate as many and complex lighting variations as possible, random gamma correction is introduced for data augmentation.
颜色扰动和随机噪声
Figure PCTCN2021137980-appb-000033
由于亮度一致性歧义问题的存在,任何颜色扰动都会改变图像的像素分布,破坏基于立体视觉的自监督损失的有效性。因此自监督损失难以在存在颜色扰动的情况下保持鲁棒性。通过随机对图像的RGB像素值进行扰动,并添加上随机高斯噪声等等,以辅助数据增强并模拟尽可能多的扰动变化。
Color perturbation and random noise
Figure PCTCN2021137980-appb-000033
Due to the brightness consistency ambiguity problem, any color perturbation will change the pixel distribution of the image, destroying the effectiveness of the stereo vision-based self-supervised loss. It is therefore difficult for self-supervised losses to remain robust in the presence of color perturbations. By randomly perturbing the RGB pixel values of the image, and adding random Gaussian noise, etc., to assist data enhancement and simulate as many perturbation changes as possible.
需要说明的是,本实施例在数据增强分支中,只使用了三种数据增强策略,并没有枚举所有数据增强策略的组合并拼凑出最优的数据增强组合。作为替代方案,可使用一些特殊的自适应数据增强方案。It should be noted that, in the data enhancement branch of this embodiment, only three data enhancement strategies are used, and all combinations of data enhancement strategies are not enumerated and an optimal data enhancement combination is pieced together. As an alternative, some special adaptive data enhancement schemes can be used.
其中,S322以深度图像为伪标签对数据增强处理后的多视角图像对进行监督,获取不同数据增强策略下的数据损失。数据增强策略需要保证有一个相对可靠的参考标准,在有监督训练中这个参考标准往往是随机数据增强针对真值标签的不变性,而在自监督训练中这个假设却无法成立,因为无法获得真值标签。因此,本实施例以标准自监督训练的分支预测的深 度图作为伪真值标签,即步骤S2中的深度估计处理中的深度图,要求随机数据增强后的预测结果尽可能保证针对伪标签的不变性。这一操作可以将数据增强与自监督损失解耦,而不会影响到自监督损失的亮度一致性假设。Among them, S322 uses the depth image as a pseudo-label to supervise the multi-view image pair after data enhancement processing, and obtains the data loss under different data enhancement strategies. The data augmentation strategy needs to ensure that there is a relatively reliable reference standard. In supervised training, this reference standard is often the invariance of random data augmentation to the true value label, but in self-supervised training, this assumption cannot be established, because the true value cannot be obtained. value label. Therefore, in this embodiment, the depth map of the branch prediction of the standard self-supervised training is used as the false true value label, that is, the depth map in the depth estimation process in step S2. Immutability. This operation decouples data augmentation from the self-supervised loss without affecting the brightness consistency assumption of the self-supervised loss.
步骤S321中的几个数据增强策略组合起来可以得到综合的数据增强函数:
Figure PCTCN2021137980-appb-000034
标准自监督分支预测(深度估计处理)的深度图为D,数据增强分支预测的深度图为
Figure PCTCN2021137980-appb-000035
计算数据增强一致性损失,数据增强一致性损失L DA的表达式为:
Several data enhancement strategies in step S321 can be combined to obtain a comprehensive data enhancement function:
Figure PCTCN2021137980-appb-000034
The depth map for standard self-supervised branch prediction (depth estimation process) is D, and the depth map for data augmentation branch prediction is
Figure PCTCN2021137980-appb-000035
To calculate the data augmentation consistency loss, the expression of the data augmentation consistency loss LDA is:
Figure PCTCN2021137980-appb-000036
Figure PCTCN2021137980-appb-000036
其中,
Figure PCTCN2021137980-appb-000037
为随机遮挡掩码,
Figure PCTCN2021137980-appb-000038
为伽马校正,
Figure PCTCN2021137980-appb-000039
为染色绕道和随机噪声,
Figure PCTCN2021137980-appb-000040
表示随机遮挡掩码
Figure PCTCN2021137980-appb-000041
中的二进制非遮挡有效区域掩码,
Figure PCTCN2021137980-appb-000042
Figure PCTCN2021137980-appb-000043
为点乘,D为网络预测出的深度图。
in,
Figure PCTCN2021137980-appb-000037
is a random occlusion mask,
Figure PCTCN2021137980-appb-000038
for gamma correction,
Figure PCTCN2021137980-appb-000039
for colored detours and random noise,
Figure PCTCN2021137980-appb-000040
represents a random occlusion mask
Figure PCTCN2021137980-appb-000041
The binary non-occlusion valid area mask in ,
Figure PCTCN2021137980-appb-000042
and
Figure PCTCN2021137980-appb-000043
is the dot product, and D is the depth map predicted by the network.
在训练过程中,每次都对每张图像采用不同的随机数据增强策略,随后利用上述公式计算损失L DA。另外,由于数据增强损失要求整体的训练过程已经收敛,因此在训练前期的数据增强损失如果权重过大,可能导致自监督训练无法收敛。为此,根据训练进度自适应地调整数据增强损失的影响权重。在起始阶段权重为0.01,随后每经过两个epoch权重翻倍一次。在网络收敛后,数据增强损失才起到实质作用。 During the training process, different random data augmentation strategies are applied to each image each time, and then the loss L DA is calculated using the above formula. In addition, since the data augmentation loss requires that the overall training process has converged, if the weight of the data augmentation loss in the early stage of training is too large, it may cause the self-supervised training to fail to converge. To this end, the influence weights of the data augmentation loss are adaptively adjusted according to the training progress. The weight is 0.01 at the beginning, and then doubles every two epochs. The data augmentation loss only plays a substantial role after the network has converged.
特别地,由于整个自监督训练框架使用了较多操作,特别是数据增强分支把整个网络跑了两次前向。一般来说,如果直接使用并行的前向-反向更新策略,训练过程中的GPU显存是不够的(默认11G),存在现存溢出问题。本实施例针对显存溢出问题,采用了用时间换空间的策略。将原始的一次前向-计算自监督损失-反向传播的流程切分为两组前向反向传播的过程。先前向计算标准分支的自监督损失,然后反向传播更新梯度,清除缓存,并保存标准分支预测的深度图结果作为伪标签;随后前向计算数据 增强分支的自监督损失,在利用伪标签监督其训练。由于多个损失的梯度更新被解耦到不同阶段,不需要同时占用显存,很大程度上减少了GPU的显存占用。In particular, since the entire self-supervised training framework uses many operations, especially the data augmentation branch runs the entire network forward twice. Generally speaking, if the parallel forward-reverse update strategy is directly used, the GPU memory during the training process is not enough (11G by default), and there is an existing overflow problem. In this embodiment, a strategy of exchanging time for space is adopted for the problem of video memory overflow. The original one-step forward-computing self-supervised loss-backpropagation process is divided into two sets of forward and backpropagation processes. The self-supervised loss of the standard branch is calculated in the forward direction, then the gradient is updated by back-propagation, the cache is cleared, and the depth map result predicted by the standard branch is saved as a pseudo-label; then the self-supervised loss of the data-enhanced branch is calculated in the forward direction. its training. Since the gradient updates of multiple losses are decoupled into different stages, there is no need to occupy the video memory at the same time, which greatly reduces the video memory occupation of the GPU.
在本实施例中,S4根据光度一致性损失、语义一致性损失和数据增强一致性损失构建损失函数。损失函数L的表达式为:In this embodiment, S4 constructs a loss function according to the loss of photometric consistency, the loss of semantic consistency, and the loss of data enhancement consistency. The expression of the loss function L is:
L=L PC+L DA+L SC L=L PC +L DA +L SC
其中,L PC为光度一致性损失,L D□为数据增强一致性损失,L SC为语义一致性损失。 Among them, L PC is the photometric consistency loss, L D□ is the data enhancement consistency loss, and L SC is the semantic consistency loss.
本实施例将传统的立体匹配替换成了基于深度学习的稠密深度图估计,根据损失函数构建并训练出一个神经网络模型,将该神经网络模型应用到完整的三维重建中,得到三维模型,效果可以与人工标注样本的方法相当。本实施例提供了一个低成本训练高精度三维重建模型的替代方案,可以拓展到地图勘探、自动驾驶、AR/VR等与三维重建相关的场景。In this embodiment, the traditional stereo matching is replaced by the dense depth map estimation based on deep learning, a neural network model is constructed and trained according to the loss function, and the neural network model is applied to the complete three-dimensional reconstruction to obtain a three-dimensional model. It can be comparable to the method of manually annotating samples. This embodiment provides an alternative solution for training a high-precision 3D reconstruction model at a low cost, which can be extended to scenarios related to 3D reconstruction, such as map exploration, automatic driving, and AR/VR.
基于DTU数据集对本实施例提出的方法进行检测,实验检测结果如说明书附图9所示。其中,本发明实施例提出的DACS-MS在DTU数据集上平均每个点的重建误差为0.358mm,远小于同类无监督方法如MVS、MVS 2、M 3VSNet。与有监督方法相比,DACS-MS也接近现有技术中最先进的有监督方法,超越部分现有的有监督方法。实验结果表面,本实施提出的自监督方法在DTU数据集上超过了传统的无监督三维重建方法,并且能够实现与最先进的有监督方法相当的效果。利用本实施例提供的基于协同分割与数据增强的自监督三维重建方法重建出的模型效果如说明书附图10和说明书附图11所示。本实施例的实验结果为附图第三列所示,由具体的实验结果表面,本实施例完全能够实现与有监督方法相同或相近的技术效果,重建出的三维模型符合技术要求。 The method proposed in this embodiment is detected based on the DTU data set, and the experimental detection result is shown in FIG. 9 in the description. Among them, the DACS-MS proposed in the embodiment of the present invention has an average reconstruction error of 0.358 mm for each point on the DTU data set, which is much smaller than similar unsupervised methods such as MVS, MVS 2 , and M 3 VSNet. Compared with supervised methods, DACS-MS is also close to the state-of-the-art supervised methods and surpasses some existing supervised methods. Experimental results show that the self-supervised method proposed in this implementation outperforms traditional unsupervised 3D reconstruction methods on the DTU dataset and can achieve comparable results to state-of-the-art supervised methods. The effect of the model reconstructed by using the self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement provided in this embodiment is shown in FIG. 10 and FIG. 11 in the description. The experimental results of this embodiment are shown in the third column of the accompanying drawings. From the specific experimental results, this embodiment can fully achieve the same or similar technical effects as the supervised method, and the reconstructed 3D model meets the technical requirements.
本实施例提供了基于协同分割与数据增强的自监督三维重建方法,针对亮度一致性歧义问题,引入抽象的语义线索以及在自监督信号中嵌入数 据增强机制,增强了自监督信号在噪声扰动下的可靠性。本实施例提出的自监督训练方法超越了传统的无监督方法,并能与一些领先的有监督方法取得相当的效果。基于协同分割的语义一致性损失,动态地从多视图对中通过聚类挖掘出共有语义信息部件。数据增强一致性损失将自监督的分支扩展为双流结构,使用标准分支的预测结果作为伪标签,监督数据增强分支的预测结果,将数据增强对比一致性与亮度一致性假设解缠,分别进行处理,实现在自监督信号中引入大量的数据增强扩充训练集中的变化。整个流程无需任何标签数据,不依赖于真值标注,而是从数据本身挖掘出有效信息实现网络的训练,极大节约了成本,缩短了重建进程。将深度预测、协同分割以及数据增强融合到一起,在解决了显存溢出问题地基础上,提升了自监督信号的精度,使本实施例具备更好的泛化性。This embodiment provides a self-supervised 3D reconstruction method based on collaborative segmentation and data enhancement. For the brightness consistency ambiguity problem, abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the self-supervised signal under noise disturbance. reliability. The self-supervised training method proposed in this embodiment surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods. Based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering. The data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set. The whole process does not require any label data and does not rely on the true value labeling. Instead, effective information is mined from the data itself to implement network training, which greatly saves costs and shortens the reconstruction process. Integrating depth prediction, collaborative segmentation and data enhancement together, on the basis of solving the problem of video memory overflow, the accuracy of the self-supervised signal is improved, so that this embodiment has better generalization.
实施例2Example 2
本实施例在实施例1的基础上,将实施例1提出的一种基于协同分割与数据增强的自监督三维重建方法模块化,形成一种基于协同分割与数据增强的自监督三维重建系统,各模块示意图如说明书附图12所示,完整的系统结构图如说明书附图13所示。On the basis of Embodiment 1, this embodiment modularizes a self-supervised 3D reconstruction method based on collaborative segmentation and data enhancement proposed in Embodiment 1 to form a self-supervised 3D reconstruction system based on collaborative segmentation and data enhancement, A schematic diagram of each module is shown in FIG. 12 in the specification, and a complete system structure diagram is shown in FIG. 13 in the specification.
一种基于协同分割与数据增强的自监督三维重建系统,包括依次连接的输入单元1、深度处理单元2、双支处理单元3、损失函数构建单元4和输出单元5。A self-supervised three-dimensional reconstruction system based on collaborative segmentation and data enhancement includes an input unit 1, a depth processing unit 2, a dual-branch processing unit 3, a loss function construction unit 4 and an output unit 5 that are connected in sequence.
输入单元1,用于获取输入数据,根据输入数据获取具有重合区域且视角相似的多视角图像对。输入单元包括输入数据获取单元11、转换单元12、筛选单元13和预处理单元14。The input unit 1 is used for acquiring input data, and acquiring multi-view image pairs having overlapping regions and similar viewing angles according to the input data. The input unit includes an input data acquisition unit 11 , a conversion unit 12 , a screening unit 13 and a preprocessing unit 14 .
深度处理单元2,用于通过对多视角图像对进行深度估计处理,获取光度一致性损失。深度处理单元包括深度图像获取单元21、回归损失获取单元22和光度损失获取单元23。The depth processing unit 2 is configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair. The depth processing unit includes a depth image acquisition unit 21 , a regression loss acquisition unit 22 and a luminosity loss acquisition unit 23 .
双支处理单元3,包括协同分割单元31和数据增强单元32,协同分割单元31和数据增强单元32并行运行,协同分割单元31用于通过对多视角图像对进行协同分割处理,获取语义一致性损失。数据增强单元32用于通过对多视角图像对进行数据增强处理,获取数据增强一致性损失。The dual-branch processing unit 3 includes a collaborative segmentation unit 31 and a data enhancement unit 32. The collaborative segmentation unit 31 and the data enhancement unit 32 run in parallel. The collaborative segmentation unit 31 is used to obtain semantic consistency by performing collaborative segmentation processing on multi-view image pairs. loss. The data enhancement unit 32 is configured to obtain the data enhancement consistency loss by performing data enhancement processing on the multi-view image pair.
损失函数构建单元4,用于根据光度一致性损失、语义一致性损失和数据增强一致性损失构建损失函数。The loss function construction unit 4 is used for constructing a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss.
输出单元5,用于根据损失函数构建并训练神经网络模型,基于神经网络模型获取输入数据的三维模型。The output unit 5 is used for constructing and training a neural network model according to the loss function, and obtaining a three-dimensional model of the input data based on the neural network model.
其中,深度处理单元2包括深度图像获取单元21、回归损失获取单元22和光度损失获取单元23。基本原理包括:将多视角图像对和参考视图输入到深度估计网络进行深度估计,可获得深度图,将深度图和多视角图像对进行单应性映射,对非参考视角上的深度图像进行重建得到重投影视图像,通过计算重投影视图和参考视图之间的差异可获取回归损失,基于回归损失获取光度一致性误差。具体结构包括:The depth processing unit 2 includes a depth image acquisition unit 21 , a regression loss acquisition unit 22 and a luminance loss acquisition unit 23 . The basic principles include: inputting multi-view image pairs and reference views into the depth estimation network for depth estimation, obtaining a depth map, performing homography mapping on the depth map and multi-view image pairs, and reconstructing the depth images from non-reference views. The reprojected view image is obtained, the regression loss can be obtained by calculating the difference between the reprojected view and the reference view, and the photometric consistency error is obtained based on the regression loss. The specific structure includes:
深度图像获取单元21,用于通过深度估计网络对多视角图像进行深度估计,获取深度图像。The depth image acquisition unit 21 is configured to perform depth estimation on a multi-view image through a depth estimation network to acquire a depth image.
回归损失获取单元22,用于获取参考视角和非参考视角,通过单应性映射将非参考视角上的深度图像进行重建得到重投影视图像,并根据重投影视图像计算回归损失。The regression loss obtaining unit 22 is configured to obtain a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate the regression loss according to the reprojection view image.
光度损失获取单元23,用于根据回归损失获取光度一致性损失。The photometric loss obtaining unit 23 is configured to obtain the photometric consistency loss according to the regression loss.
其中,协同分割单元31包括分割图像获取单元311、交叉熵损失获取单元312和语义损失获取单元313。协同分割单元31的基本原理包括:将参考视图和多视角图像对输入到预训练的VGG网络,接着进行非负矩阵分解,获取参考视角下的协同分割图像和非参考视角下的协同分割图像,对非参考视角下的协同分割图像进行单应性投影获取重投影协同分割图像,计算重投影协同分割图像与参考视角下的协同分割图像之间的交叉熵损失, 进而获取语义一致性误差。具体结构包括:The collaborative segmentation unit 31 includes a segmented image acquisition unit 311 , a cross-entropy loss acquisition unit 312 and a semantic loss acquisition unit 313 . The basic principle of the collaborative segmentation unit 31 includes: inputting a reference view and a multi-view image pair into a pre-trained VGG network, and then performing non-negative matrix decomposition to obtain the collaborative segmentation image under the reference view and the collaborative segmentation image under the non-reference view, Perform homography projection on the co-segmented image in the non-reference view to obtain the re-projected co-segmented image, calculate the cross-entropy loss between the re-projected co-segmented image and the co-segmented image under the reference view, and then obtain the semantic consistency error. The specific structure includes:
分割图像获取单元311,用于通过非负矩阵对多视角图像对进行协同分割,获取协同分割图像。The segmented image obtaining unit 311 is configured to perform cooperative segmentation on the multi-view image pair by using a non-negative matrix to obtain a cooperatively segmented image.
交叉熵损失获取单元312,用于获取参考视角和非参考视角,通过单应性映射将非参考视角上的协同分割图像进行重建得到重投影协同分割图像,并计算重投影协同分割图像与参考视角上的协同分割图像之间的交叉熵损失。The cross-entropy loss obtaining unit 312 is configured to obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image and the reference view Cross-entropy loss between co-segmented images on .
语义损失获取单元313,用于根据交叉熵损失获取语义一致性损失。The semantic loss obtaining unit 313 is configured to obtain the semantic consistency loss according to the cross entropy loss.
其中,数据增强单元32包括数据处理单元321、数据损失获取单元322和数据一致性损失获取单元323。数据增强单元32的基本原理包括:将参考视图和多视角图像对输入到深度处理单元进行深度估计处理,获取深度图,根据深度图获取有效区域掩码,将有效区域掩码作为伪标签。对参考视图和多视角图像对进行随机数据增强后输入搭配深度估计网络进行深度估计处理获取对比深度图,计算对比深度图和伪标签之间的差异,进而获取数据增强一致性损失。具体结构包括:The data enhancement unit 32 includes a data processing unit 321 , a data loss obtaining unit 322 and a data consistency loss obtaining unit 323 . The basic principle of the data enhancement unit 32 includes: inputting the reference view and the multi-view image pair to the depth processing unit to perform depth estimation processing, obtaining a depth map, obtaining an effective area mask according to the depth map, and using the effective area mask as a pseudo-label. After random data enhancement is performed on the reference view and multi-view image pair, the depth estimation process is performed with the depth estimation network to obtain the contrast depth map, and the difference between the contrast depth map and the pseudo-label is calculated, and then the consistency loss of data enhancement is obtained. The specific structure includes:
数据处理单元321,用于采用不同的数据增强策略对多视角图像对进行数据增强处理。数据处理单元设置有深度估计网络。The data processing unit 321 is configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies. The data processing unit is provided with a depth estimation network.
数据损失获取单元322,用于以深度图像为伪标签对数据增强处理后的多视角损失图像对进行监督,获取不同数据增强策略下的数据损失。The data loss obtaining unit 322 is configured to supervise the multi-view loss image pair after data enhancement processing with the depth image as a pseudo-label, and obtain the data loss under different data enhancement strategies.
数据一致性损失获取单元323,用于根据数据损失获取数据增强一致性损失。The data consistency loss obtaining unit 323 is configured to obtain the data enhancement consistency loss according to the data loss.
其中,输入单元1包括输入数据获取单元11、转换单元12、筛选单元13和预处理单元14。具体结构包括:The input unit 1 includes an input data acquisition unit 11 , a conversion unit 12 , a screening unit 13 and a preprocessing unit 14 . The specific structure includes:
输入数据获取单元11,用于获取输入数据,输入数据包括图像或视频。The input data acquisition unit 11 is used for acquiring input data, and the input data includes images or videos.
转换单元12,用于判断输入数据是否为图像:若是,则在输入数据中选取多视角图像。若否,则将输入数据转换为多视角图像。The conversion unit 12 is configured to determine whether the input data is an image: if so, select a multi-view image from the input data. If not, convert the input data to a multi-view image.
筛选单元13,用于根据多视角图像获取视角相似且具有相同区域的多视角图像对。The screening unit 13 is configured to acquire, according to the multi-view images, pairs of multi-view images with similar viewing angles and having the same area.
预处理单元14,用于对多视角图像对进行图像预处理。The preprocessing unit 14 is configured to perform image preprocessing on the multi-view image pair.
本实施例在实施例1的基础上,提出了一种基于深度学习的样本图像生成系统,将实施例1的方法模块化,形成一种具体的系统,使其更具备实用性。On the basis of Embodiment 1, this embodiment proposes a deep learning-based sample image generation system, and modularizes the method of Embodiment 1 to form a specific system, making it more practical.
本发明针对现有技术,提供了基于协同分割与数据增强的自监督三维重建方法及系统。针对亮度一致性歧义问题,引入抽象的语义线索以及在自监督信号中嵌入数据增强机制,增强了自监督信号在噪声扰动下的可靠性。本发明提出的自监督训练方法超越了传统的无监督方法,并能与一些领先的有监督方法取得相当的效果。基于协同分割的语义一致性损失,动态地从多视图对中通过聚类挖掘出共有语义信息部件。数据增强一致性损失将自监督的分支扩展为双流结构,使用标准分支的预测结果作为伪标签,监督数据增强分支的预测结果,将数据增强对比一致性与亮度一致性假设解缠,分别进行处理,实现在自监督信号中引入大量的数据增强扩充训练集中的变化。整个流程无需任何标签数据,不依赖于真值标注,而是从数据本身挖掘出有效信息实现网络的训练,极大节约了成本,缩短了重建进程。将深度预测、协同分割以及数据增强融合到一起,在解决了显存溢出问题地基础上,提升了自监督信号的精度,使本实施例具备更好的泛化性。将方法模块化,形成一种具体的系统,使其更具备实用性。Aiming at the prior art, the present invention provides a self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement. Aiming at the brightness consistency ambiguity problem, abstract semantic clues are introduced and a data enhancement mechanism is embedded in the self-supervised signal, which enhances the reliability of the self-supervised signal under noise disturbance. The self-supervised training method proposed in the present invention surpasses traditional unsupervised methods and can achieve comparable results with some leading supervised methods. Based on the semantic consistency loss of collaborative segmentation, the shared semantic information components are dynamically mined from multi-view pairs through clustering. The data enhancement consistency loss extends the self-supervised branch into a dual-stream structure, uses the prediction results of the standard branch as pseudo-labels, supervises the prediction results of the data enhancement branch, and disentangles the data enhancement contrast consistency and brightness consistency assumptions and processes them separately. , enabling the introduction of a large amount of data augmentation in the self-supervised signal to augment the changes in the training set. The whole process does not require any label data and does not rely on the true value labeling. Instead, effective information is mined from the data itself to implement network training, which greatly saves costs and shortens the reconstruction process. Integrating depth prediction, collaborative segmentation and data enhancement together, on the basis of solving the problem of video memory overflow, the accuracy of the self-supervised signal is improved, so that this embodiment has better generalization. The method is modularized to form a specific system, making it more practical.
本领域普通技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个计算装置上,或者分布在多个计算装置所组成的网络上,可选地,他们可以用计算机装置可执行的程序代码来实现,从而可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步 骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件的结合。Those of ordinary skill in the art should understand that the above-mentioned modules or steps of the present invention can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed on a network composed of multiple computing devices. Optionally, they may be implemented in program code executable by a computer device, so that they can be stored in a storage device and executed by the computing device, or they can be fabricated separately into individual integrated circuit modules, or a plurality of modules of them Or the steps are made into a single integrated circuit module to realize. As such, the present invention is not limited to any specific combination of hardware and software.
注意,上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解,本发明不限于这里的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本发明进行了较为详细的说明,但是本发明不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其他等效实施例,而本发明的范围由所附的权利要求范围决定。Note that the above are only preferred embodiments of the present invention and applied technical principles. Those skilled in the art will understand that the present invention is not limited to the specific embodiments herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present invention. The scope is determined by the scope of the appended claims.
以上公开的仅为本发明的几个具体实施场景,但是,本发明并非局限于此,任何本领域的技术人员能思之的变化都应落入本发明的保护范围。The above disclosures are only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any changes that can be conceived by those skilled in the art should fall within the protection scope of the present invention.

Claims (18)

  1. 一种基于协同分割与数据增强的自监督三维重建方法,其特征在于,包括:A self-supervised three-dimensional reconstruction method based on collaborative segmentation and data enhancement, characterized in that it includes:
    图像对获取:获取输入数据,根据所述输入数据获取具有重合区域且视角相似的多视角图像对;Image pair acquisition: acquire input data, and acquire multi-view image pairs with overlapping areas and similar viewing angles according to the input data;
    深度估计处理:通过对所述多视角图像对进行深度估计处理,获取光度一致性损失;Depth estimation processing: obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair;
    协同分割处理:通过对所述多视角图像对进行协同分割处理,获取语义一致性损失;Collaborative segmentation processing: obtain semantic consistency loss by performing collaborative segmentation processing on the multi-view image pair;
    数据增强处理:通过对所述多视角图像对进行数据增强处理,获取数据增强一致性损失;Data enhancement processing: obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;
    构建损失函数:根据所述光度一致性损失、所述语义一致性损失和所述数据增强一致性损失构建损失函数;Constructing a loss function: constructing a loss function according to the photometric consistency loss, the semantic consistency loss, and the data augmentation consistency loss;
    模型输出:根据所述损失函数构建并训练神经网络模型,基于所述神经网络模型获取与所述输入数据对应的三维模型。Model output: construct and train a neural network model according to the loss function, and obtain a three-dimensional model corresponding to the input data based on the neural network model.
  2. 根据权利要求1所述的方法,其特征在于,所述协同分割处理具体包括:The method according to claim 1, wherein the cooperative segmentation process specifically comprises:
    协同分割图像获取:通过非负矩阵对所述多视角图像对进行协同分割,获取协同分割图像;Obtaining a collaboratively segmented image: performing collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;
    交叉熵损失获取:获取参考视角和非参考视角,将所述非参考视角上的协同分割图像进行重建得到重投影协同分割图像,并计算所述重投影协同分割图像与所述参考视角上的协同分割图像之间的交叉熵损失;Cross-entropy loss acquisition: obtain a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view to obtain a re-projected co-segmented image, and calculate the collaboration between the re-projected co-segmented image and the reference view Cross-entropy loss between segmented images;
    语义一致性损失获取:根据所述交叉熵损失获取语义一致性损失。Semantic consistency loss acquisition: The semantic consistency loss is obtained according to the cross-entropy loss.
  3. 根据权利要求1或2所述的方法,其特征在于,所述深度估计处理具体包括:The method according to claim 1 or 2, wherein the depth estimation process specifically comprises:
    基于深度估计网络对所述多视角图像进行深度估计,获取深度图像;Perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;
    获取参考视角和非参考视角,将所述非参考视角上的深度图像进行重建得到重投影视图像,并根据所述重投影视图像计算回归损失;obtaining a reference viewing angle and a non-reference viewing angle, reconstructing the depth image on the non-reference viewing angle to obtain a reprojected view image, and calculating a regression loss according to the reprojected view image;
    根据所述回归损失获取光度一致性损失。The photometric consistency loss is obtained from the regression loss.
  4. 根据权利要求3所述的方法,其特征在于,所述数据增强处理具体包括:The method according to claim 3, wherein the data enhancement processing specifically comprises:
    采用不同的数据增强策略对所述多视角图像对进行数据增强;Using different data enhancement strategies to perform data enhancement on the multi-view image pair;
    以所述深度图像为伪标签对数据增强后的多视角图像对进行监督,获取不同所述数据增强策略下的数据损失;Using the depth image as a pseudo-label to supervise the data-enhanced multi-view image pair to obtain data loss under different data enhancement strategies;
    根据所述数据损失获取数据增强一致性损失。A data enhancement consistency loss is obtained according to the data loss.
  5. 根据权利要求1所述的方法,其特征在于,所述图像对获取具体包括:The method according to claim 1, wherein the acquiring of the image pair specifically comprises:
    获取输入数据,所述输入数据包括图像或视频;obtaining input data, the input data including images or videos;
    判断所述输入数据是否为图像:若是,则在所述输入数据中选取多视角图像;若否,则将所述输入数据转换为多视角图像;Determine whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;
    在所述多视角图像中获取视角相似且具有相同区域的多视角图像对;obtaining a pair of multi-view images with similar viewing angles and having the same area in the multi-view images;
    对所述多视角图像对进行图像预处理。Image preprocessing is performed on the multi-view image pair.
  6. 根据权利要求5所述的方法,其特征在于,所述“在所述多视角图像中获取视角相似且具有相同区域的多视角图像对”还包括:The method according to claim 5, wherein the "obtaining a pair of multi-view images with similar viewing angles and having the same area in the multi-view images" further comprises:
    通过二维尺度不变图像特征对所述多视角图像进行特征匹配,获取图像特征的匹配程度;Perform feature matching on the multi-view image by using two-dimensional scale-invariant image features to obtain the matching degree of image features;
    根据所述匹配程度计算图像之间的视角重合程度,并对所述视角重合程度进行排序,获取视角相似且具有相同区域的多视角图像对。The degree of overlapping of viewing angles between the images is calculated according to the matching degree, and the degrees of overlapping of viewing angles are sorted to obtain a pair of multi-view images with similar viewing angles and having the same area.
  7. 根据权利要求2所述的方法,其特征在于,所述协同分割图像获取具体包括:The method according to claim 2, wherein the cooperatively segmented image acquisition specifically comprises:
    通过卷积神经网络对所述多视角图像对中的每张图像进行特征提取,获取每个视角的特征图张量,所有视角的特征图张量构成特征图矩阵;Perform feature extraction on each image in the multi-view image pair by using a convolutional neural network to obtain a feature map tensor of each view, and the feature map tensors of all views form a feature map matrix;
    通过链式迭代式对所述特征图矩阵进行非负矩阵分解,求得第一非负矩阵和第二非负矩阵;Perform non-negative matrix decomposition on the feature map matrix through chain iteration to obtain a first non-negative matrix and a second non-negative matrix;
    将所述第一非负矩阵转换为与图像维度对应的格式,获取协同分割图像。Convert the first non-negative matrix into a format corresponding to the image dimension to obtain a collaboratively segmented image.
  8. 根据权利要求7所述的方法,其特征在于,所述特征图矩阵的表达式为:The method according to claim 7, wherein the expression of the feature map matrix is:
    A∈R V×H×W×C A∈R V×H×W×C
    所述第一非负矩阵和所述第二非负矩阵的表达式分别为:The expressions of the first non-negative matrix and the second non-negative matrix are respectively:
    P∈R V×H×W×K,Q∈R C×K P∈R V×H×W×K , Q∈R C×K
    所述协同分割图像的表达式为:The expression of the co-segmented image is:
    S∈R V×H×W×K S∈R V×H×W×K
    其中,A为所述特征图矩阵,S为所述协同分割图像,P为所述第一非负矩阵,Q为所述第二非负矩阵,V为总视角数,H和W为图像的高和宽,C为所述卷积神经网络中卷积层的通道数,K表示非负矩阵分解过程中的所述第一非负矩阵P的列数,也是所述第二非负矩阵Q的行数,R为实数。Among them, A is the feature map matrix, S is the co-segmented image, P is the first non-negative matrix, Q is the second non-negative matrix, V is the total number of viewing angles, H and W are the image Height and width, C is the number of channels of the convolutional layer in the convolutional neural network, K is the number of columns of the first non-negative matrix P in the non-negative matrix decomposition process, and is also the second non-negative matrix Q The number of rows, R is a real number.
  9. 根据权利要求2或7所述的方法,其特征在于,所述交叉熵损失获取具体包括:The method according to claim 2 or 7, wherein the obtaining of the cross-entropy loss specifically comprises:
    在所有视角中选取一个参考视角,除所述参考视角以外的视角为非参考视角,获取所述参考视角下的协同分割图像和所述非参考视角下的协同分割图像;Selecting a reference viewing angle from all viewing angles, and viewing angles other than the reference viewing angle are non-reference viewing angles, and obtaining a collaboratively segmented image under the reference viewing angle and a collaboratively segmented image under the non-reference viewing angle;
    根据单应性公式计算同一位置的像素分别在所述参考视角下与所述非参考视角下的对应关系;Calculate the corresponding relationship between the pixels at the same position under the reference viewing angle and the non-reference viewing angle respectively according to the homography formula;
    基于单应性映射公式和双线性插值策略,将所述非参考视角下的协同分割图像投影到参考视角下进行重建,获得重投影协同分割图像;Based on the homography mapping formula and the bilinear interpolation strategy, the collaboratively segmented image under the non-reference viewing angle is projected to the reference viewing angle for reconstruction, and the reprojected collaboratively segmented image is obtained;
    计算所述重投影协同分割图像与所述参考视角下的协同分割图像之间的交叉熵损失。A cross-entropy loss between the reprojected co-segmented image and the co-segmented image in the reference view is calculated.
  10. 根据权利要求9所述的方法,其特征在于,所述参考视角下的协同分割图像和所述非参考视角下的协同分割图像的表达式分别为:The method according to claim 9, wherein the expressions of the collaboratively segmented image under the reference viewing angle and the collaboratively segmented image under the non-reference viewing angle are respectively:
    S 1∈R H×W×K,S i∈R H×W×K S 1 ∈R H×W×K , S i ∈R H×W×K
    其中,S 1为所述参考视角下的协同分割图像,S i为所述非参考视角下的协同分割图像V为总视角数,H和W为图像的高和宽,K表示所述第一非负矩阵P的列数,也是所述第二非负矩阵Q的行数,i为非参考视角,2≤i≤V; Wherein, S 1 is the collaboratively segmented image under the reference viewing angle, S i is the collaboratively segmented image under the non-reference viewing angle, V is the total number of viewing angles, H and W are the height and width of the image, and K represents the first The number of columns of the non-negative matrix P is also the number of rows of the second non-negative matrix Q, i is the non-reference viewing angle, 2≤i≤V;
    所述对应关系表达式为:The corresponding relationship expression is:
    Figure PCTCN2021137980-appb-100001
    Figure PCTCN2021137980-appb-100001
    所述重投影协同分割图像
    Figure PCTCN2021137980-appb-100002
    表达式为:
    The reprojected co-segmentation image
    Figure PCTCN2021137980-appb-100002
    The expression is:
    Figure PCTCN2021137980-appb-100003
    Figure PCTCN2021137980-appb-100003
    其中,p j为像素在参考视角下的位置,
    Figure PCTCN2021137980-appb-100004
    为像素在非参考视角下的位置,j表示图像中像素的索引值,D表示网络预测出的深度图,
    Figure PCTCN2021137980-appb-100005
    为所述重投影协同分割图像。
    Among them, p j is the position of the pixel under the reference viewing angle,
    Figure PCTCN2021137980-appb-100004
    is the position of the pixel in the non-reference view, j represents the index value of the pixel in the image, D represents the depth map predicted by the network,
    Figure PCTCN2021137980-appb-100005
    Co-segment the image for the reprojection.
  11. 根据权利要求10所述的方法,其特征在于,所述交叉熵损失表达式为:The method according to claim 10, wherein the expression of the cross entropy loss is:
    f(S 1,j)=onehot(argmax(S 1,j)) f(S 1,j )=onehot(argmax(S 1,j ))
    所述语义一致性误差表达式为:The semantic consistency error expression is:
    Figure PCTCN2021137980-appb-100006
    Figure PCTCN2021137980-appb-100006
    其中,f(S 1,j)为所述交叉熵损失,L SC为所述语义一致性误差,M i表示的是从非参考视角单应性投影映射到参考视角的有效区域,N为自然数集,i为非参考视角,j表示图像中像素的索引值,H和W为图像的高和宽,
    Figure PCTCN2021137980-appb-100007
    为所述重投影协同分割图像,S 1为所述参考视角下的协同分割图像。
    Among them, f(S 1,j ) is the cross-entropy loss, L SC is the semantic consistency error, M i represents the effective area mapped from the non-reference view homography to the reference view, and N is a natural number Set, i is the non-reference viewing angle, j is the index value of the pixel in the image, H and W are the height and width of the image,
    Figure PCTCN2021137980-appb-100007
    For the reprojected collaboratively segmented image, S1 is the collaboratively segmented image under the reference viewing angle.
  12. 根据权利要求4所述的方法,其特征在于,所述数据增强策略包括 随机遮挡掩码、伽马校正、颜色扰动和随机噪声。The method of claim 4, wherein the data augmentation strategy includes random occlusion masks, gamma correction, color perturbation, and random noise.
  13. 根据权利要求12所述的方法,其特征在于,所述数据增强一致性损失的表达式为:The method according to claim 12, wherein the expression of the data enhancement consistency loss is:
    Figure PCTCN2021137980-appb-100008
    Figure PCTCN2021137980-appb-100008
    其中,L DA为所述数据增强一致性损失,数据增强函数
    Figure PCTCN2021137980-appb-100009
    Figure PCTCN2021137980-appb-100010
    为所述随机遮挡掩码,
    Figure PCTCN2021137980-appb-100011
    为所述伽马校正,
    Figure PCTCN2021137980-appb-100012
    为所述颜色扰动和随机噪声,
    Figure PCTCN2021137980-appb-100013
    表示所述随机遮挡掩码
    Figure PCTCN2021137980-appb-100014
    中的二进制非遮挡有效区域掩码,D为所述深度图。
    Among them, L DA is the data enhancement consistency loss, and the data enhancement function
    Figure PCTCN2021137980-appb-100009
    Figure PCTCN2021137980-appb-100010
    is the random occlusion mask,
    Figure PCTCN2021137980-appb-100011
    for the gamma correction,
    Figure PCTCN2021137980-appb-100012
    for the color perturbation and random noise,
    Figure PCTCN2021137980-appb-100013
    represents the random occlusion mask
    Figure PCTCN2021137980-appb-100014
    The binary non-occlusion effective area mask in , D is the depth map.
  14. 一种基于协同分割与数据增强的自监督三维重建系统,其特征在于,包括:A self-supervised three-dimensional reconstruction system based on collaborative segmentation and data enhancement, characterized in that it includes:
    输入单元,用于获取输入数据,根据所述输入数据获取具有重合区域且视角相似的多视角图像对;an input unit for acquiring input data, and acquiring pairs of multi-view images with overlapping regions and similar viewing angles according to the input data;
    深度处理单元,用于通过对所述多视角图像对进行深度估计处理,获取光度一致性损失,a depth processing unit, configured to obtain the loss of luminosity consistency by performing depth estimation processing on the multi-view image pair,
    双支处理单元,包括协同分割单元和数据增强单元,所述协同分割单元和所述数据增强单元并行运行,协同分割单元用于通过对所述多视角图像对进行协同分割处理,获取语义一致性损失;数据增强单元用于通过对所述多视角图像对进行数据增强处理,获取数据增强一致性损失;A dual-branch processing unit includes a collaborative segmentation unit and a data enhancement unit. The collaborative segmentation unit and the data enhancement unit run in parallel, and the collaborative segmentation unit is used to obtain semantic consistency by performing collaborative segmentation processing on the multi-view image pair. loss; the data enhancement unit is configured to obtain data enhancement consistency loss by performing data enhancement processing on the multi-view image pair;
    损失函数构建单元,用于根据所述光度一致性损失、所述语义一致性损失和所述数据增强一致性损失构建损失函数;a loss function construction unit, configured to construct a loss function according to the photometric consistency loss, the semantic consistency loss and the data enhancement consistency loss;
    输出单元,用于根据所述损失函数构建并训练神经网络模型,基于所述神经网络模型获取所述输入数据的三维模型。An output unit, configured to construct and train a neural network model according to the loss function, and obtain a three-dimensional model of the input data based on the neural network model.
  15. 根据权利要求14所述的系统,其特征在于,所述输入单元包括:The system of claim 14, wherein the input unit comprises:
    输入数据获取单元,用于获取输入数据,所述输入数据包括图像或视 频;an input data acquisition unit for acquiring input data, the input data including images or videos;
    转换单元,用于判断所述输入数据是否为图像:若是,则在所述输入数据中选取多视角图像;若否,则将所述输入数据转换为多视角图像;A conversion unit, configured to judge whether the input data is an image: if so, select a multi-view image from the input data; if not, convert the input data into a multi-view image;
    筛选单元,用于根据所述多视角图像获取视角相似且具有相同区域的多视角图像对;a screening unit, configured to acquire pairs of multi-view images with similar viewing angles and having the same area according to the multi-view images;
    预处理单元,用于对所述多视角图像对进行图像预处理。A preprocessing unit, configured to perform image preprocessing on the multi-view image pair.
  16. 根据权利要求14或15所述的系统,其特征在于,所述协同分割单元包括:The system according to claim 14 or 15, wherein the cooperative segmentation unit comprises:
    分割图像获取单元,用于通过非负矩阵对所述多视角图像对进行协同分割,获取协同分割图像;a segmented image acquisition unit, configured to perform collaborative segmentation on the multi-view image pair through a non-negative matrix to obtain a collaboratively segmented image;
    交叉熵损失获取单元,用于获取参考视角和非参考视角,通过单应性映射将所述非参考视角上的协同分割图像进行重建得到重投影协同分割图像,并计算所述重投影协同分割图像与所述参考视角上的协同分割图像之间的交叉熵损失;A cross-entropy loss acquisition unit is used to acquire a reference view and a non-reference view, reconstruct the co-segmented image on the non-reference view through homography to obtain a re-projected co-segmented image, and calculate the re-projected co-segmented image the cross-entropy loss with the co-segmented image on the reference view;
    语义损失获取单元,用于根据所述交叉熵损失获取语义一致性损失。The semantic loss obtaining unit is configured to obtain the semantic consistency loss according to the cross entropy loss.
  17. 根据权利要求16所述的系统,其特征在于,所述深度处理单元包括:The system of claim 16, wherein the deep processing unit comprises:
    深度图像获取单元,用于基于深度估计网络对所述多视角图像进行深度估计,获取深度图像;a depth image acquisition unit, configured to perform depth estimation on the multi-view image based on a depth estimation network to obtain a depth image;
    回归损失获取单元,用于获取参考视角和非参考视角,通过单应性映射将所述非参考视角上的深度图像进行重建得到重投影视图像,并根据所述重投影视图像计算回归损失;a regression loss acquisition unit, configured to acquire a reference view angle and a non-reference view angle, reconstruct the depth image on the non-reference view angle through homography to obtain a reprojected view image, and calculate a regression loss according to the reprojection view image;
    光度损失获取单元,用于根据所述回归损失获取光度一致性损失。The photometric loss obtaining unit is configured to obtain the photometric consistency loss according to the regression loss.
  18. 根据权利要求17所述的系统,其特征在于,所述数据增强单元包括:The system of claim 17, wherein the data enhancement unit comprises:
    数据处理单元,用于采用不同的数据增强策略对所述多视角图像对进行数据增强处理;a data processing unit, configured to perform data enhancement processing on the multi-view image pair by adopting different data enhancement strategies;
    数据损失获取单元,用于以所述深度图像为伪标签对所述数据处理单 元处理后的多视角损失图像对进行监督,获取不同所述数据增强策略下的数据损失;A data loss acquisition unit, configured to supervise the multi-view loss image pair processed by the data processing unit by using the depth image as a pseudo-label to acquire data loss under different data enhancement strategies;
    数据一致性损失获取单元,用于根据所述数据损失获取数据增强一致性损失。The data consistency loss obtaining unit is configured to obtain the data enhancement consistency loss according to the data loss.
PCT/CN2021/137980 2021-02-05 2021-12-14 Self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement WO2022166412A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110162782.9A CN112767468B (en) 2021-02-05 2021-02-05 Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement
CN202110162782.9 2021-02-05

Publications (1)

Publication Number Publication Date
WO2022166412A1 true WO2022166412A1 (en) 2022-08-11

Family

ID=75705190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137980 WO2022166412A1 (en) 2021-02-05 2021-12-14 Self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement

Country Status (2)

Country Link
CN (1) CN112767468B (en)
WO (1) WO2022166412A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862149A (en) * 2022-12-30 2023-03-28 广州紫为云科技有限公司 Method and system for generating 3D human skeleton key point data set
CN115860091A (en) * 2023-02-15 2023-03-28 武汉图科智能科技有限公司 Depth feature descriptor learning method based on orthogonal constraint
CN115965758A (en) * 2022-12-28 2023-04-14 无锡东如科技有限公司 Three-dimensional reconstruction method for image cooperation monocular instance
CN117152168A (en) * 2023-10-31 2023-12-01 山东科技大学 Medical image segmentation method based on frequency band decomposition and deep learning
CN117333758A (en) * 2023-12-01 2024-01-02 博创联动科技股份有限公司 Land route identification system based on big data analysis
CN117437363A (en) * 2023-12-20 2024-01-23 安徽大学 Large-scale multi-view stereoscopic method based on depth perception iterator
CN117541662A (en) * 2024-01-10 2024-02-09 中国科学院长春光学精密机械与物理研究所 Method for calibrating camera internal parameters and deriving camera coordinate system simultaneously
CN117611601A (en) * 2024-01-24 2024-02-27 中国海洋大学 Text-assisted semi-supervised 3D medical image segmentation method
CN117635679A (en) * 2023-12-05 2024-03-01 之江实验室 Curved surface efficient reconstruction method and device based on pre-training diffusion probability model
CN117635679B (en) * 2023-12-05 2024-05-28 之江实验室 Curved surface efficient reconstruction method and device based on pre-training diffusion probability model

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767468B (en) * 2021-02-05 2023-11-03 中国科学院深圳先进技术研究院 Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement
CN113379767B (en) * 2021-06-18 2022-07-08 中国科学院深圳先进技术研究院 Method for constructing semantic disturbance reconstruction network for self-supervision point cloud learning
CN113592913B (en) * 2021-08-09 2023-12-26 中国科学院深圳先进技术研究院 Method for eliminating uncertainty of self-supervision three-dimensional reconstruction
WO2023015414A1 (en) * 2021-08-09 2023-02-16 中国科学院深圳先进技术研究院 Method for eliminating uncertainty in self-supervised three-dimensional reconstruction
CN115082628B (en) * 2022-07-27 2022-11-15 浙江大学 Dynamic drawing method and device based on implicit optical transfer function
CN115222790B (en) * 2022-08-11 2022-12-30 中国科学技术大学 Single photon three-dimensional reconstruction method, system, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130107006A1 (en) * 2011-10-28 2013-05-02 New York University Constructing a 3-dimensional image from a 2-dimensional image and compressing a 3-dimensional image to a 2-dimensional image
CN109191515A (en) * 2018-07-25 2019-01-11 北京市商汤科技开发有限公司 A kind of image parallactic estimation method and device, storage medium
CN109712228A (en) * 2018-11-19 2019-05-03 中国科学院深圳先进技术研究院 Establish method, apparatus, electronic equipment and the storage medium of Three-dimension Reconstruction Model
CN110246212A (en) * 2019-05-05 2019-09-17 上海工程技术大学 A kind of target three-dimensional rebuilding method based on self-supervisory study
CN112767468A (en) * 2021-02-05 2021-05-07 中国科学院深圳先进技术研究院 Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503680B (en) * 2019-08-29 2023-08-18 大连海事大学 Unsupervised convolutional neural network-based monocular scene depth estimation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130107006A1 (en) * 2011-10-28 2013-05-02 New York University Constructing a 3-dimensional image from a 2-dimensional image and compressing a 3-dimensional image to a 2-dimensional image
CN109191515A (en) * 2018-07-25 2019-01-11 北京市商汤科技开发有限公司 A kind of image parallactic estimation method and device, storage medium
CN109712228A (en) * 2018-11-19 2019-05-03 中国科学院深圳先进技术研究院 Establish method, apparatus, electronic equipment and the storage medium of Three-dimension Reconstruction Model
CN110246212A (en) * 2019-05-05 2019-09-17 上海工程技术大学 A kind of target three-dimensional rebuilding method based on self-supervisory study
CN112767468A (en) * 2021-02-05 2021-05-07 中国科学院深圳先进技术研究院 Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965758A (en) * 2022-12-28 2023-04-14 无锡东如科技有限公司 Three-dimensional reconstruction method for image cooperation monocular instance
CN115862149B (en) * 2022-12-30 2024-03-22 广州紫为云科技有限公司 Method and system for generating 3D human skeleton key point data set
CN115862149A (en) * 2022-12-30 2023-03-28 广州紫为云科技有限公司 Method and system for generating 3D human skeleton key point data set
CN115860091A (en) * 2023-02-15 2023-03-28 武汉图科智能科技有限公司 Depth feature descriptor learning method based on orthogonal constraint
CN117152168B (en) * 2023-10-31 2024-02-09 山东科技大学 Medical image segmentation method based on frequency band decomposition and deep learning
CN117152168A (en) * 2023-10-31 2023-12-01 山东科技大学 Medical image segmentation method based on frequency band decomposition and deep learning
CN117333758B (en) * 2023-12-01 2024-02-13 博创联动科技股份有限公司 Land route identification system based on big data analysis
CN117333758A (en) * 2023-12-01 2024-01-02 博创联动科技股份有限公司 Land route identification system based on big data analysis
CN117635679A (en) * 2023-12-05 2024-03-01 之江实验室 Curved surface efficient reconstruction method and device based on pre-training diffusion probability model
CN117635679B (en) * 2023-12-05 2024-05-28 之江实验室 Curved surface efficient reconstruction method and device based on pre-training diffusion probability model
CN117437363A (en) * 2023-12-20 2024-01-23 安徽大学 Large-scale multi-view stereoscopic method based on depth perception iterator
CN117437363B (en) * 2023-12-20 2024-03-22 安徽大学 Large-scale multi-view stereoscopic method based on depth perception iterator
CN117541662A (en) * 2024-01-10 2024-02-09 中国科学院长春光学精密机械与物理研究所 Method for calibrating camera internal parameters and deriving camera coordinate system simultaneously
CN117541662B (en) * 2024-01-10 2024-04-09 中国科学院长春光学精密机械与物理研究所 Method for calibrating camera internal parameters and deriving camera coordinate system simultaneously
CN117611601A (en) * 2024-01-24 2024-02-27 中国海洋大学 Text-assisted semi-supervised 3D medical image segmentation method
CN117611601B (en) * 2024-01-24 2024-04-23 中国海洋大学 Text-assisted semi-supervised 3D medical image segmentation method

Also Published As

Publication number Publication date
CN112767468B (en) 2023-11-03
CN112767468A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
WO2022166412A1 (en) Self-supervised three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement
Lee et al. From big to small: Multi-scale local planar guidance for monocular depth estimation
Oh et al. Fast video object segmentation by reference-guided mask propagation
Sankaranarayanan et al. Learning from synthetic data: Addressing domain shift for semantic segmentation
CN109859190B (en) Target area detection method based on deep learning
Qi et al. SGUIE-Net: Semantic attention guided underwater image enhancement with multi-scale perception
Zhou et al. Cross-view enhancement network for underwater images
CN108764250B (en) Method for extracting essential image by using convolutional neural network
CN110363068B (en) High-resolution pedestrian image generation method based on multiscale circulation generation type countermeasure network
Batsos et al. Recresnet: A recurrent residual cnn architecture for disparity map enhancement
CN107992874A (en) Image well-marked target method for extracting region and system based on iteration rarefaction representation
US20220343525A1 (en) Joint depth prediction from dual-cameras and dual-pixels
WO2019136591A1 (en) Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
Laga A survey on deep learning architectures for image-based depth reconstruction
Zhou et al. FSAD-Net: Feedback spatial attention dehazing network
Suin et al. Degradation aware approach to image restoration using knowledge distillation
CN111582437B (en) Construction method of parallax regression depth neural network
CN116391206A (en) Stereoscopic performance capture with neural rendering
US11875490B2 (en) Method and apparatus for stitching images
CN115082966B (en) Pedestrian re-recognition model training method, pedestrian re-recognition method, device and equipment
Greene et al. MultiViewStereoNet: Fast Multi-View Stereo Depth Estimation using Incremental Viewpoint-Compensated Feature Extraction
CN113537359A (en) Training data generation method and device, computer readable medium and electronic equipment
Zhu et al. DANet: dynamic salient object detection networks leveraging auxiliary information
Lin et al. E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning
Xiong et al. Monocular depth estimation using self-supervised learning with more effective geometric constraints

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21924405

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21924405

Country of ref document: EP

Kind code of ref document: A1