CN115115691A - Monocular three-dimensional plane recovery method, equipment and storage medium - Google Patents

Monocular three-dimensional plane recovery method, equipment and storage medium Download PDF

Info

Publication number
CN115115691A
CN115115691A CN202210739676.7A CN202210739676A CN115115691A CN 115115691 A CN115115691 A CN 115115691A CN 202210739676 A CN202210739676 A CN 202210739676A CN 115115691 A CN115115691 A CN 115115691A
Authority
CN
China
Prior art keywords
feature
plane
inputting
internal
monocular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210739676.7A
Other languages
Chinese (zh)
Inventor
常青玲
崔岩
任飞
徐世廷
杨鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Germany Zhuhai Artificial Intelligence Institute Co ltd
Wuyi University
Original Assignee
China Germany Zhuhai Artificial Intelligence Institute Co ltd
Wuyi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Germany Zhuhai Artificial Intelligence Institute Co ltd, Wuyi University filed Critical China Germany Zhuhai Artificial Intelligence Institute Co ltd
Priority to CN202210739676.7A priority Critical patent/CN115115691A/en
Priority to PCT/CN2022/110039 priority patent/WO2024000728A1/en
Publication of CN115115691A publication Critical patent/CN115115691A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a monocular three-dimensional plane recovery method, equipment and a storage medium, wherein the method comprises the following steps: performing multi-scale feature extraction on an input image to obtain a first feature map and a second feature map under two scales; inputting the first feature map into a first inner encoder and a first outer encoder respectively, and extracting first internal features of the first image blocks in the first feature map and first associated features among the first image blocks respectively; fusing the first internal feature and the first associated feature, and inputting the fused first internal feature and the fused first associated feature into a first decoder for decoding to obtain a prediction plane parameter and a prediction plane area; and performing three-dimensional recovery according to the plane parameters and the plane area to obtain a predicted three-dimensional plane. The invention can effectively improve the comprehensiveness of the feature extraction and improve the recovery precision of the monocular three-dimensional plane by respectively extracting the internal features of the image blocks in the corresponding feature map and the associated features among the image blocks, fusing the features and inputting the fused features into a decoder for decoding.

Description

Monocular three-dimensional plane recovery method, equipment and storage medium
Technical Field
The present invention relates to the field of image data processing, and in particular, to a method, an apparatus, and a storage medium for recovering a monocular three-dimensional plane.
Background
The three-dimensional plane recovery needs to divide a plane area of a scene from image dimensions, meanwhile, plane parameters of a corresponding area are estimated, the three-dimensional plane recovery can be achieved according to the plane area and the plane parameters, and a predicted three-dimensional plane is obtained through reconstruction.
In the related technology, the monocular three-dimensional plane restoration emphasizes the reconstruction precision, the accuracy of the model structure is enhanced by analyzing the edge of the plane structure and the embeddability of the scene, but the identification capability of a tiny plane area is lacked, a small part of pixel area is easily lost in the plane detection process, and the monocular three-dimensional plane restoration precision is influenced.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a monocular three-dimensional plane restoration method, equipment and a storage medium, which can extract the features of the internal features of the feature map, effectively improve the comprehensiveness of feature extraction and further improve the precision of monocular three-dimensional plane restoration.
An embodiment of a first aspect of the present invention provides a monocular three-dimensional plane restoration method, including:
performing multi-scale feature extraction on an input image to obtain a first feature map and a second feature map under two scales;
inputting the first feature map into a first inner encoder and a first outer encoder respectively, and extracting first internal features of the first image blocks in the first feature map and first associated features among the first image blocks respectively;
inputting the second feature map into a second inner encoder and a second outer encoder respectively, and extracting second internal features of the second image blocks in the second feature map and second associated features among the second image blocks respectively;
fusing the first internal feature and the first associated feature, and inputting the fused first internal feature and the fused first associated feature into a first decoder for decoding to obtain a prediction plane parameter and a prediction plane area;
fusing the second internal features and the second associated features, and inputting the fused second internal features and the second associated features into a second decoder for decoding to obtain a predicted non-planar area, wherein the predicted non-planar area is used for verifying the predicted planar area;
and performing three-dimensional recovery according to the plane parameters and the plane area to obtain a predicted three-dimensional plane.
According to the above embodiments of the present invention, at least the following advantages are provided: by arranging the inner encoder and the outer encoder, the internal features of the image blocks in the corresponding feature images and the associated features between the image blocks are respectively extracted, and then the internal features and the associated features are fused and input into the decoder for decoding, so that the comprehensiveness of feature extraction can be effectively improved, the probability of image information loss is reduced, and further the precision of monocular three-dimensional plane restoration can be improved.
According to some embodiments of the first aspect of the present invention, performing multi-scale feature extraction on an input image to obtain a first feature map and a second feature map at two scales includes:
performing multi-scale feature extraction on an input image to obtain a first extraction image and a second extraction image under two scales;
and embedding corresponding position information into the first extraction graph and the second extraction graph respectively to obtain a first characteristic graph and a second characteristic graph under two scales.
According to some embodiments of the first aspect of the present invention, inputting the first feature map into a first inner encoder and a first outer encoder respectively, and extracting a first internal feature of the first image block in the first feature map and a first associated feature between the first image blocks respectively, includes:
cutting the first characteristic diagram into a plurality of first image blocks;
inputting each first image block into a first inner encoder, and extracting to obtain first internal features of each first image block;
and inputting each first image block into a first outer encoder, and extracting to obtain a first associated characteristic among the first image blocks.
According to some embodiments of the first aspect of the present invention, inputting the second feature map into a second inner encoder and a second outer encoder respectively, and extracting a second inner feature of the second image block in the second feature map and a second associated feature between the second image blocks respectively, includes:
cutting the second characteristic diagram into a plurality of second image blocks;
inputting each second image block into a second inner encoder, and extracting to obtain second internal features of each second image block;
and inputting each second image block into a second outer encoder, and extracting to obtain second associated features among the second image blocks.
According to some embodiments of the first aspect of the present invention, the merging the first internal feature and the first associated feature and inputting the merged feature to the first decoder for decoding to obtain the prediction plane parameter and the prediction plane region includes:
adding elements of the first internal feature and the first associated feature to obtain a first fusion feature;
and inputting the first fusion characteristics into a first decoder to perform decoding classification by taking the plane area and the plane parameters as labels to obtain predicted plane parameters and predicted plane areas.
According to some embodiments of the first aspect of the present invention, the merging the second internal feature and the second associated feature and inputting them to the second decoder for decoding to obtain the predicted non-planar region includes:
adding elements of the second internal feature and the second associated feature to obtain a second fusion feature;
and inputting the second fusion characteristics into a second decoder to perform decoding classification by taking the non-planar area as a label to obtain a predicted non-planar area.
According to some embodiments of the first aspect of the present invention, after fusing the second internal feature and the second associated feature and inputting them to the second decoder for decoding, the method further includes:
the weights of the first decoder are updated based on the predicted planar regions, the predicted non-planar regions and the loss function.
According to some embodiments of the first aspect of the present invention, updating the weights of the first decoder according to the prediction non-planar region and the loss function comprises:
updating the weight of the first decoder according to the prediction plane region, the prediction non-plane region and a cross entropy loss function, wherein the cross entropy loss function is as follows:
Figure BDA0003717326180000031
Y + and Y - Respectively representing a planar area marking pixel and a non-planar area marking pixel, P i Representing the probability that the ith pixel belongs to a planar region,
Figure BDA0003717326180000032
representing the probability that the ith pixel belongs to a non-planar region, and w is the ratio of the planar region pixel label to the non-planar region pixel label.
An embodiment of a second aspect of the present invention provides an electronic device, including:
the computer program can be executed by the processor, and the monocular three-dimensional plane restoration method according to any one of the first aspect is implemented when the computer program is executed by the processor.
The electronic device of the embodiment of the second aspect applies the monocular three-dimensional plane restoration method of any one of the first aspect, and therefore has all the advantages of the first aspect of the present invention.
According to a third aspect of the present invention, there is provided a computer storage medium storing computer-executable instructions for performing the monocular three-dimensional plane restoration method according to any one of the first aspect.
The computer storage medium of the embodiment of the third aspect can execute the monocular three-dimensional plane restoration method of any one of the first aspect, and therefore has all the advantages of the first aspect of the present invention.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a diagram of the main steps of a monocular three-dimensional plane restoration method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the step S100 in the monocular three-dimensional plane restoration method according to the embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a step S200 in the monocular three-dimensional plane restoration method according to the embodiment of the present invention;
FIG. 4 is a schematic diagram of the step S300 in the monocular three-dimensional plane restoration method according to the embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating the step S400 in the monocular three-dimensional plane restoration method according to the embodiment of the present invention;
fig. 6 is a schematic diagram illustrating a step S500 in the monocular three-dimensional plane restoration method according to the embodiment of the present invention;
fig. 7 is a frame diagram of a network to which the monocular three-dimensional plane restoration method according to the embodiment of the present invention is applied.
Detailed Description
In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions. In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. Furthermore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
With the development of deep learning, the field of computer vision has received the attention of more and more researchers. Three-dimensional plane restoration and reconstruction technology is one of the mainstream research tasks in the field of computer vision at present, three-dimensional plane restoration of a single picture needs to segment a plane instance region of a scene from image dimensions, and simultaneously estimate plane parameters of each instance region, generally, a non-plane region is represented by depth estimated by a network model. The technology has wide application prospect in the fields of virtual reality, augmented reality, robots and the like.
The plane detection and recovery method for a single picture needs to simultaneously perform research on image depth, plane normal, plane segmentation and the like. The traditional three-dimensional plane restoration and reconstruction method based on the artificial feature extraction only extracts the shallow texture information of the image and depends on the prior condition of plane geometry, so that the defect of weak generalization capability exists. The real indoor scene is very complex, and multiple shadows and various folding obstructions generated by complex light rays influence the quality of plane restoration and reconstruction, so that the traditional method is difficult to deal with the plane reconstruction task of the complex indoor scene. The plane restoration reconstruction is an important research direction in three-dimensional reconstruction, most of the existing three-dimensional reconstruction methods firstly generate point cloud data through a three-dimensional visual method, then generate a nonlinear scene surface through fitting related points, optimize an overall reconstruction model through global reasoning, recognize a plane area of a scene through combination of segmented plane restoration reconstruction and a visual instance segmentation method, represent a plane by using three parameters and a segmentation mask under a Cartesian coordinate system, and have better reconstruction accuracy and effect. The segmented plane restoration reconstruction is a multi-stage reconstruction method, and the accuracy of plane identification and parameter estimation influences the result of the final model.
The three-dimensional plane restoration needs to divide a plane area of a scene from the dimension of an image, simultaneously estimates plane parameters of a corresponding area, can realize the three-dimensional plane restoration according to the plane area and the plane parameters, and rebuilds to obtain a predicted three-dimensional plane.
There are several plane restoration methods: an end-to-end convolutional neural network architecture Planenet can deduce a fixed number of plane instance masks and plane parameters from a single RGB graph; learning directly from depth modes of loss induced by planar structures by predicting a fixed number of planes; by improving a two-stage Mask R-CNN method, object class classification is replaced by plane geometric prediction, and then a convolutional neural network is used for refining a plane segmentation Mask; by predicting pixel-by-pixel plane parameters, adopting an associated embedding method, training network parameters to map each pixel to an embedding space, and clustering the embedded pixels to generate a plane example; a plane refinement method constrained by Manhattan world hypothesis strengthens the refinement of plane parameters by limiting the geometric relationship between plane instances; by the processing method of dividing and treating the plane segmentation of the panoramic image from the horizontal direction and the vertical direction, the method can effectively recover the distorted plane example aiming at the pixel distribution difference between the panoramic image and the common image; the method PlaneTR based on the Transformer module effectively improves the efficiency of plane detection by adding the center and edge characteristics of a plane instance.
In the related technology, monocular three-dimensional plane restoration emphasizes reconstruction precision, accuracy of a model structure is enhanced by analyzing edges of a plane structure and embedability of a scene, but identification capability of a small plane area is lacked, a small part of pixel area is easily lost in a plane detection process, and monocular three-dimensional plane restoration precision is influenced.
Based on this, in order to obtain more excellent results and use less computing resources, the encoder part of the Transformer module is applied to the image block sequence and applied to the image classification task, so that more excellent results and less computing resources can be obtained compared with the most advanced convolution network. If the target detection problem is described as a sequence-to-sequence prediction problem, a set of objects that interact with a sequence of contextual features is predicted directly from the learned object queries. A new simple object detection paradigm is proposed that builds on a standard transform encoder-decoder architecture that is free of many hand-designed components such as anchor point generation and non-maximum suppression. In order to solve the problem of suboptimal expression learning caused by the lack of learning capability of a convolutional network on a low-level feature tensor, semantic segmentation is redefined as a sequence-to-sequence prediction task, a pure encoder based on a self-attention mechanism is provided, the dependence on convolutional operation is eliminated, and the problem of limited receptive field is solved.
The method, the device and the storage medium for recovering a monocular three-dimensional plane according to the present invention are described below with reference to fig. 1 to 7, and can perform feature extraction on internal features of a feature map, so as to effectively improve the comprehensiveness of feature extraction, and further improve the accuracy of recovering a monocular three-dimensional plane.
Referring to fig. 1, a monocular three-dimensional plane restoration method according to an embodiment of the first aspect of the present invention includes at least some steps:
s100, performing multi-scale feature extraction on an input image to obtain a first feature map and a second feature map under two scales;
s200, inputting the first feature maps into a first inner encoder and a first outer encoder respectively, and extracting first internal features of the first image blocks in the first feature maps and first associated features among the first image blocks respectively;
s300, inputting the second feature maps into a second inner encoder and a second outer encoder respectively, and extracting second internal features of second image blocks in the second feature maps and second associated features among the second image blocks respectively;
s400, fusing the first internal feature and the first associated feature, and inputting the fused first internal feature and the fused first associated feature into a first decoder for decoding to obtain a prediction plane parameter and a prediction plane area;
s500, fusing the second internal feature and the second associated feature, and inputting the fused second internal feature and the second associated feature into a second decoder for decoding to obtain a predicted non-planar area, wherein the predicted non-planar area is used for verifying the predicted planar area;
s600, three-dimensional recovery is carried out according to the plane parameters and the plane area, and a predicted three-dimensional plane is obtained.
The comprehensiveness of the obtained information can be improved by carrying out multi-scale feature extraction on an input image, the comprehensiveness of the feature extraction can be effectively improved by arranging an inner encoder and an outer encoder, respectively extracting the internal features of image blocks in a corresponding feature image and the associated features among the image blocks, and then fusing the internal features and the associated features and inputting the fused features into a decoder for decoding, thereby reducing the probability of image information loss, and further improving the precision of monocular three-dimensional plane restoration.
It can be understood that, referring to fig. 2, in step S100, performing multi-scale feature extraction on an input image to obtain a first feature map and a second feature map at two scales includes:
s110, performing multi-scale feature extraction on an input image to obtain a first extraction image and a second extraction image under two scales;
and S120, respectively embedding corresponding position information into the first extraction graph and the second extraction graph to obtain a first characteristic graph and a second characteristic graph under two scales.
Specifically, step S110, performing multi-scale feature extraction on an input image through an HRNet convolutional network to obtain a first extraction graph and a second extraction graph under two scales;
and S120, respectively embedding corresponding position information into the first extraction graph and the second extraction graph through position embedding, and respectively converting the position information into tokens to obtain a first feature graph and a second feature graph under two scales.
The first feature map corresponds to a scale of HW/16, and the second feature map corresponds to a scale of HW/32, where H and W represent the height and width of the input image, respectively.
To obtain more details, the input data is further encoded by an Attention mechanism into subdivided patches, i.e. subdivided image blocks, the computation effort can be effectively reduced by dividing the feature map into a plurality of disjoint areas, performing Windows Multi-Head Self-orientation (W-MSA) on the patch embedding of different feature maps, combining tokens from different stages of the vision transform into class image representations of different resolutions, and combining them step by step into full resolution prediction using a convolutional decoder. Compared with a full convolution network, the multiscale dense vision transform avoids the feature loss caused by the down-sampling operation after the image block is embedded and calculated, and provides finer and more globally consistent prediction.
It can be understood that, referring to fig. 3, step S200, inputting the first feature maps into the first inner encoder and the first outer encoder respectively, and extracting the first inner features of the first image blocks in the first feature maps and the first associated features between the first image blocks respectively, includes:
s210, cutting the first characteristic diagram into a plurality of first image blocks;
s220, inputting each first image block into a first inner encoder, and extracting to obtain first internal features of each first image block;
and S230, inputting each first image block into the first outer encoder, and extracting to obtain a first associated feature among the first image blocks, wherein the first associated feature is used for representing the relationship among the image blocks.
By splitting the first characteristic diagram into a plurality of first image blocks, the loss of a pixel plane area with a small occupation ratio can be effectively avoided.
It can be understood that, referring to fig. 4, in step S300, inputting the second feature map into the second inner encoder and the second outer encoder respectively, and extracting the second inner features of the second image block in the second feature map and the second associated features between the second image blocks respectively, including:
s310, cutting the second characteristic diagram into a plurality of second image blocks;
s320, inputting each second image block into a second inner encoder, and extracting to obtain second internal features of each second image block;
and S330, inputting each second image block into a second outer encoder, and extracting to obtain second associated features among the second image blocks, wherein the second associated features are used for representing the relation among the image blocks.
By splitting the first characteristic diagram into a plurality of first image blocks, the loss of a pixel plane area with a small occupation ratio can be effectively avoided.
It can be understood that, referring to fig. 5, step S400, fusing the first internal feature and the first associated feature and inputting them to the first decoder for decoding, so as to obtain the prediction plane parameters and the prediction plane region, includes:
s410, performing element addition on the first internal feature and the first associated feature to obtain a first fusion feature;
and S420, inputting the first fusion characteristics into a first decoder to perform decoding classification by using the plane area and the plane parameters as labels to obtain the prediction plane parameters and the prediction plane area.
The comprehensiveness of feature extraction can be effectively improved by fusing the first internal features and the first relevant features, and the accuracy of final three-dimensional plane recovery can be further improved.
It can be understood that, referring to fig. 6, in step S500, the step of fusing the second internal feature and the second associated feature and inputting the fused second internal feature and second associated feature to the second decoder for decoding to obtain the predicted non-planar region includes:
s510, performing element addition on the second internal feature and the second associated feature to obtain a second fusion feature;
s520, inputting the second fusion characteristics into a second decoder to perform decoding classification by taking the non-planar area as a label, so as to obtain a prediction non-planar area.
The comprehensiveness of feature extraction can be effectively improved by fusing the second internal features and the second relevant features, and the accuracy of final three-dimensional plane recovery can be further improved.
When a scene is changed, the detection precision in the recovery of the three-dimensional plane in the related technology is obviously insufficient, and the robustness is low.
Based on this, in order to improve the detection accuracy and robustness for the change of the scene, it can be understood that, in step S500, after the second internal feature and the second associated feature are fused and input to the second decoder for decoding, and a predicted non-planar region is obtained, the method further includes:
the weights of the first decoder are updated based on the predicted planar regions, the predicted non-planar regions and the penalty function.
In the process of three-dimensional plane recovery, the first decoder is iteratively updated through the loss function, so that the precision of plane region prediction during three-dimensional plane recovery can be effectively improved, the performance of the whole network is dynamically updated, and the detection precision and robustness of scene change can be improved.
It can be understood that the weights of the first decoder are updated according to the predicted planar region, the predicted non-planar region and the penalty function, specifically:
updating the weight of the first decoder according to the prediction plane region, the prediction non-plane region and a cross entropy loss function, wherein the cross entropy loss function is as follows:
Figure BDA0003717326180000081
Y + and Y - Respectively representing a planar area marking pixel and a non-planar area marking pixel, P i Representing the probability that the ith pixel belongs to a planar region,
Figure BDA0003717326180000082
representing the probability that the ith pixel belongs to a non-planar region, and w is the ratio of the planar region pixel label to the non-planar region pixel label. Because the scales of the first decoder and the second decoder are different from the definitions of the positive label and the negative label, the plane area is optimized through variation information, and the positive label and the negative label are a plane area label and a non-plane area label.
Mutual information is a measure of the degree of dependence between two random variables based on shannon entropy, and can capture the non-linear statistical correlation between variables, and mutual information between X and Z can be understood as the amount of uncertainty in X given Z:
Figure BDA0003717326180000083
where H (X) is Shannon entropy, H (X | Z) is the conditional entropy of Z given X, P XZ Is a joint probability distribution of two variables, P X And P Z Are the respective edge probability distributions. At the same time, mutual information is equivalent to P XZ And P X And P Z KL divergence of product (Kullback-Leibler):
Figure BDA0003717326180000084
when the joint probability P XZ And the product of the edges
Figure BDA0003717326180000085
The greater the divergence in (c), the stronger the dependency between X and Z, and therefore mutual information does not exist for two variables that are completely independent. Mutual information is typically used in unsupervised characterization learning networks, but mutual information estimation is difficult to estimate as a bijective function and may result in a sub-optimal representation independent of downstream tasks. While a highly non-linear evaluation framework may lead to better downstream performance, it defeats the purpose of learning an efficient migratable data representation. And a knowledge distillation framework based on mutual information defines the mutual information as the difference value between the entropy value of the teacher model and the entropy value of the teacher model under the condition of the student model, and enables the student model to learn the characteristic distribution of the teacher model by maximizing the mutual information between the teacher-student network.
Based on the above, the invention enhances the feature expression by the mutual information of the planar features of the two maximized scale network branches. In the PlanemT network model framework, two network branches with different scales are respectively arrangedCorresponding to the first decoder and the second decoder for detecting and obtaining the predicted plane area S P And predicting non-planar region S' N-P And in the most ideal case, the predicted planar area and the predicted non-planar area are inverted:
S′ P ∶=S′ N-P
thus, the outputs of the two network branches predict the plane area variable S P And S' P As a variational information metric for information maximization:
Figure BDA0003717326180000086
since mutual information is difficult to calculate, a lower bound of variation is proposed for each mutual information item I (X; Z), p (X | Z) is modeled with a variable Gaussian q (X | Z):
I(S P ;S′ P )=H(S P )-H(S P |S′ P )
Figure BDA0003717326180000091
the last inequality shows KL divergence D KL Is not negative.
Referring to fig. 7, a frame diagram of a network to which the monocular three-dimensional plane restoration method according to the embodiment of the present invention is applied is shown, after a backbone network extracts features to obtain feature diagrams with a size of 12 × 16 and a size of 6 × 8, the feature diagrams with a size of 12 × 16 are input to a first inner and outer encoder through a Position Embedding (POS), and the feature diagrams with a size of 6 × 8 are input to a second inner and outer encoder through the POS, where a loss function is a mutual information loss function.
In addition, an embodiment of the second aspect of the present invention further provides an electronic device, where the electronic device includes: a memory, a processor, and a computer program stored on the memory and executable on the processor.
The processor and memory may be connected by a bus or other means.
The memory, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer-executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software programs and instructions required to implement the monocular three-dimensional plane restoration method of the above-described first aspect embodiment are stored in the memory, and when executed by the processor, perform the monocular three-dimensional plane restoration method of the above-described embodiment, for example, perform the above-described method steps S100 to S600, method steps S110 to S120, method steps S210 to S230, method steps S310 to S330, method steps S410 to S420, and method steps S510 to S520.
The above described embodiments of the device are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Furthermore, a third embodiment of the present invention provides a computer-readable storage medium, which stores computer-executable instructions, which are executed by a processor or a controller, for example, by a processor in the above-mentioned apparatus embodiment, and can make the above-mentioned processor execute the monocular three-dimensional plane restoration method in the above-mentioned embodiment, for example, execute the above-mentioned method steps S100 to S600, S110 to S120, S210 to S230, S310 to S330, S410 to S420, and S510 to S520.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. A monocular three-dimensional plane restoration method is characterized by comprising the following steps:
performing multi-scale feature extraction on the input image to obtain a first feature map and a second feature map under two scales;
inputting the first feature map into a first inner encoder and a first outer encoder respectively, and extracting first internal features of first image blocks in the first feature map and first associated features among the first image blocks respectively;
inputting the second feature map into a second inner encoder and a second outer encoder respectively, and extracting second internal features of second image blocks in the second feature map and second associated features among the second image blocks respectively;
fusing the first internal feature and the first associated feature and inputting the fused first internal feature and the fused first associated feature into a first decoder for decoding to obtain a prediction plane parameter and a prediction plane area;
fusing the second internal feature and the second associated feature and inputting the fused second internal feature and the fused second associated feature into a second decoder for decoding to obtain a predicted non-planar area, wherein the predicted non-planar area is used for checking the predicted planar area;
and performing three-dimensional restoration according to the plane parameters and the plane area to obtain a predicted three-dimensional plane.
2. The monocular three-dimensional plane restoration method according to claim 1, wherein the performing multi-scale feature extraction on the input image to obtain a first feature map and a second feature map at two scales comprises:
performing multi-scale feature extraction on an input image to obtain a first extraction image and a second extraction image under two scales;
and embedding corresponding position information into the first extraction graph and the second extraction graph respectively to obtain a first feature graph and a second feature graph under two scales.
3. The monocular three-dimensional plane restoration method according to claim 1, wherein the inputting the first feature map into a first inner encoder and a first outer encoder respectively, and extracting a first inner feature of a first image block in the first feature map and a first associated feature between each of the first image blocks respectively comprises:
cutting the first characteristic diagram into a plurality of first image blocks;
inputting each first image block into the first inner encoder, and extracting to obtain the first internal features of each first image block;
and inputting each first image block into the first outer encoder, and extracting to obtain the first associated features among the first image blocks.
4. The monocular three-dimensional plane restoration method according to claim 1, wherein the step of inputting the second feature map into a second inner encoder and a second outer encoder respectively, and obtaining second internal features of the second image blocks in the second feature map and second associated features between the second image blocks respectively comprises:
cutting the second characteristic diagram into a plurality of second image blocks;
inputting each second image block into the second inner encoder, and extracting to obtain the second internal features of each second image block;
and inputting each second image block into the second outer encoder, and extracting to obtain the second associated features among the second image blocks.
5. The monocular three-dimensional plane restoration method according to claim 1, wherein the step of fusing the first internal feature and the first associated feature and inputting the fused first internal feature and the fused first associated feature to a first decoder for decoding to obtain a prediction plane parameter and a prediction plane region comprises:
performing element addition on the first internal feature and the first associated feature to obtain a first fusion feature;
and inputting the first fusion characteristic into the first decoder to perform decoding classification by taking a plane area and a plane parameter as labels to obtain the predicted plane parameter and the predicted plane area.
6. The monocular three-dimensional plane restoration method according to claim 1, wherein the fusing the second internal feature and the second associated feature and inputting the fused second internal feature and the fused second associated feature to a second decoder for decoding to obtain a predicted non-planar region comprises:
performing element addition on the second internal feature and the second correlation feature to obtain a second fusion feature;
and inputting the second fusion feature into the second decoder to perform decoding classification by taking the non-planar area as a label to obtain the predicted non-planar area.
7. The monocular three-dimensional plane restoration method according to claim 1, wherein after the fusing the second internal feature and the second associated feature and inputting them to a second decoder for decoding to obtain a predicted non-planar region, further comprising:
updating the weights of the first decoder according to the predicted planar region, the predicted non-planar region and a penalty function.
8. The monocular three-dimensional plane restoration method according to claim 7, wherein the updating the weight of the first decoder according to the predicted non-planar region and the loss function comprises:
updating the weight of the first decoder according to the prediction plane region, the prediction non-plane region and a cross entropy loss function, wherein the cross entropy loss function is as follows:
Figure FDA0003717326170000021
Y + and Y - Respectively representing a planar area marking pixel and a non-planar area marking pixel, P i Representing the probability that the ith pixel belongs to a planar region,
Figure FDA0003717326170000022
representing the probability that the ith pixel belongs to a non-planar region, and w is the ratio of the planar region pixel label to the non-planar region pixel label.
9. An electronic device, comprising:
a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the monocular three-dimensional plane restoration method according to any one of claims 1 to 8 when executing the computer program.
10. A computer storage medium having stored thereon computer-executable instructions for performing the monocular three-dimensional plane restoration method of any one of claims 1 to 8.
CN202210739676.7A 2022-06-28 2022-06-28 Monocular three-dimensional plane recovery method, equipment and storage medium Pending CN115115691A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210739676.7A CN115115691A (en) 2022-06-28 2022-06-28 Monocular three-dimensional plane recovery method, equipment and storage medium
PCT/CN2022/110039 WO2024000728A1 (en) 2022-06-28 2022-08-03 Monocular three-dimensional plane recovery method, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210739676.7A CN115115691A (en) 2022-06-28 2022-06-28 Monocular three-dimensional plane recovery method, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115115691A true CN115115691A (en) 2022-09-27

Family

ID=83330200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210739676.7A Pending CN115115691A (en) 2022-06-28 2022-06-28 Monocular three-dimensional plane recovery method, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN115115691A (en)
WO (1) WO2024000728A1 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11328476B2 (en) * 2019-11-14 2022-05-10 Qualcomm Incorporated Layout estimation using planes
CN111414923B (en) * 2020-03-05 2022-07-12 南昌航空大学 Indoor scene three-dimensional reconstruction method and system based on single RGB image
CN112001960B (en) * 2020-08-25 2022-09-30 中国人民解放军91550部队 Monocular image depth estimation method based on multi-scale residual error pyramid attention network model
CN112990299B (en) * 2021-03-11 2023-10-17 五邑大学 Depth map acquisition method based on multi-scale features, electronic equipment and storage medium
CN113850900B (en) * 2021-05-27 2024-06-21 北京大学 Method and system for recovering depth map based on image and geometric clues in three-dimensional reconstruction
CN113610912B (en) * 2021-08-13 2024-02-02 中国矿业大学 System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction

Also Published As

Publication number Publication date
WO2024000728A1 (en) 2024-01-04

Similar Documents

Publication Publication Date Title
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
CN111369581B (en) Image processing method, device, equipment and storage medium
US20200074178A1 (en) Method and system for facilitating recognition of vehicle parts based on a neural network
US20180114071A1 (en) Method for analysing media content
CN110287826B (en) Video target detection method based on attention mechanism
CN110163188B (en) Video processing and method, device and equipment for embedding target object in video
CN112418216A (en) Method for detecting characters in complex natural scene image
CN111652181B (en) Target tracking method and device and electronic equipment
CN113903022B (en) Text detection method and system based on feature pyramid and attention fusion
CN109492576A (en) Image-recognizing method, device and electronic equipment
CN112052845A (en) Image recognition method, device, equipment and storage medium
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
CN115063786A (en) High-order distant view fuzzy license plate detection method
CN111626295A (en) Training method and device for license plate detection model
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN111767854B (en) SLAM loop detection method combined with scene text semantic information
CN111104941B (en) Image direction correction method and device and electronic equipment
CN114170558A (en) Method, system, device, medium and article for video processing
Shit et al. An encoder‐decoder based CNN architecture using end to end dehaze and detection network for proper image visualization and detection
CN113704276A (en) Map updating method and device, electronic equipment and computer readable storage medium
KR102026280B1 (en) Method and system for scene text detection using deep learning
Wang et al. Extraction of main urban roads from high resolution satellite images by machine learning
CN114783042A (en) Face recognition method, device, equipment and storage medium based on multiple moving targets
CN115115691A (en) Monocular three-dimensional plane recovery method, equipment and storage medium
CN114387496A (en) Target detection method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination