CN117274351A - Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid - Google Patents

Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid Download PDF

Info

Publication number
CN117274351A
CN117274351A CN202311448872.XA CN202311448872A CN117274351A CN 117274351 A CN117274351 A CN 117274351A CN 202311448872 A CN202311448872 A CN 202311448872A CN 117274351 A CN117274351 A CN 117274351A
Authority
CN
China
Prior art keywords
image
scale
pyramid
semantic
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311448872.XA
Other languages
Chinese (zh)
Inventor
谭鑫
纪宇舟
谢源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202311448872.XA priority Critical patent/CN117274351A/en
Publication of CN117274351A publication Critical patent/CN117274351A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semantic three-dimensional reconstruction method for a multi-scale feature pyramid, which is characterized by comprising the following steps of: 1) Extracting continuous two-dimensional semantic features by using a two-dimensional panoramic segmentation model, and carrying out rectangular segmentation and zero filling on different features to obtain feature blocks; 2) Constructing a basic multi-scale image pyramid based on original image uniform division, and reconstructing a pyramid with a neighboring scale; 3) Training language-image coding nerve radiation field on the three-dimensional reconstruction nerve radiation field, and carrying out extensive semantic query on the three-dimensional coding nerve radiation field after training. Compared with the prior art, the method solves the problems that many semantic-containing three-dimensional reconstruction methods can only be based on a few fixed tag words, cannot understand abstract semantics and are difficult to interact in real time, provides a more effective and feasible method for realizing wide semantic interaction of three-dimensional reconstruction results, and further enables environment perception and interaction of intelligent household, industrial robots and automatic driving scenes.

Description

Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid
Technical Field
The invention relates to the technical field of three-dimensional reconstruction and cross-modal, in particular to a semantic three-dimensional reconstruction method based on a contrast language-image pre-training model and a multi-scale feature pyramid.
Background
The three-dimensional reconstruction based on the nerve radiation field refers to the reconstruction of the complete three-dimensional scene content based on a series of RGB pictures, and can directly derive surface grids or models and the like by combining different post-processing processes, so that three-dimensional representation data of the scene can be obtained without professional sampling tools, and quick and simple modeling can be realized in the fields of film and television industry, digital twinning and the like. Some other types of three-dimensional reconstruction are also widely used in robotic environment awareness and automatic path planning.
Modalities refer to the existence of data, such as text, audio, image, video, etc. file formats. The cross-modal refers to the technology of mapping the content of different modes to a unified mode, enriching the current mode by utilizing the data of other modes, and the like, and is an important interaction and knowledge migration technology. Cross-modal techniques are widely used in applications such as text-to-image generation, image understanding, and semantic queries. Due to the considerable application prospect of three-dimensional reconstruction, the related technology is rapidly developed in recent years, and has very great effect improvement on the reconstruction precision and the reconstruction speed.
In the reconstruction method in the prior art, the result does not contain semantic information, so that the reconstruction method cannot be directly applied to the interaction process. In order to pursue the segmentation effect, the existing reconstruction technology of a small amount of information with semantics is almost limited to fixed segmentation labels, and cannot perform extensive semantic query and abstract semantic understanding. This results in a reconstruction result that is also difficult to apply to intelligent environmental semantic perception in the fields of robotics or autopilot, etc.
Disclosure of Invention
The invention aims to provide a semantic three-dimensional reconstruction method of a multi-scale feature pyramid, which aims at the defects of the prior art, adopts a contrast language-image pre-training model to perform cross-mode knowledge migration, uses the multi-scale feature pyramid to realize accurate attention mechanisms of objects with different sizes in a scene, can directly realize large-scale semantic understanding of a wide scene under the condition of almost no obvious overhead increase on the basis of the original three-dimensional reconstruction method, can perform real-time query interaction based on custom words, has the advantages of simple and convenient method, good using effect, clear and convenient module upgrading and optimizing, strong deployment, can be directly applied to partial notebooks, can be effectively applied to intelligent environment perception in the fields of robots, automatic driving and the like, and has good application prospect.
The specific technical scheme for realizing the aim of the invention is as follows: the semantic three-dimensional reconstruction method for the multi-scale feature pyramid is characterized by comprising the following steps of:
step 1: two-dimensional semantic feature extraction
1-1: for an input RGB picture sequence, semantic segmentation is carried out by using any two-dimensional panoramic segmentation model to obtain knots
Fruit picture sequence seg_imgs.
1-2: and (3) according to picture parameters in each result picture sequence seg_imgs, rectangular cutting is carried out on the original picture, all continuous features are cut from (top, left) to (bottom, right) of the features, the length and the width are zero-filled to max (top-bottom and left-right), and finally the original features are moved to the center of the filling picture to obtain a square feature block set seg_Tiles.
Step 2: building optimized multi-scale image pyramids
2-1: the original picture sequence is subjected to multi-scale cutting, and the cutting proportion is set as S, if S=0.025, each picture is uniformly cut into a picture block sequence with the side length being 0.025 times of the side length of the original picture (zero filling is carried out on the edge), and the cutting proportion is set as S min =0.05 to S max 7 proportions are uniformly constructed between 0.5, so that a multi-scale picture pyramid of 7 layers is obtained, and each layer contains tiles uniformly divided by different proportions.
2-2: traversing the blocks of each layer, finding the feature in the seg_imgs where the block center point is located, and comparing the seg_tile scale of the feature blocks divided by the feature blocks with the feature blocksWhether the scale of the front layer is matched or not, and the scale of the front layer is set as S i Belonging to S min To S max Between the upper layer and the lower layer, the scale is S i-1 And S is i+1 The feature block scale is S x If and only if S i−1 ∗1.1 < S x ≤ S i+1 ∗ 1.1.1 or S x ≥ S max At the moment, the feature Image blocks are used for replacing uniformly divided Image blocks in the original Pyramid, and the optimized multi-scale Image Pyramid Multiscale construction is completed after the matching and replacing operation is carried out on all the Image blocks.
Step 3: language-image coding neural radiation field training and prediction
3-1: and (3) inputting the reconstructed Multiscale Image Pyramid Multiscale Image-Pyramid into a CLIP or Openclip contrast language-Image pre-training model, and coding the Image to a space consistent with the text to obtain a Multiscale Feature coding Pyramid Multiscale Feature-Pyramid.
3-2: the Multiscale Feature encoding Pyramid Multiscale feature_feature_pyramid is used as a reference value (GT), and a NeRF three-dimensional reconstruction model is utilized to train a language-image encoding nerve radiation Field CLIP_Field in an increasing mode.
3-3: when the query is carried out after the training is finished, the query text is put into the same contrast language-image pre-training model to obtain text codes, the code vectors and the predicted value of the language-image coding neural radiation Field CLIP_Field under the query view are subjected to dot products to obtain vector similarity, and therefore a group of points with the highest similarity are points of objects corresponding to the text query in the scene.
Compared with the prior art, the invention has the following beneficial technical effects and remarkable technical progress:
1) The invention only uses the continuous semantic information of the image instead of the segmentation label information for the segmented result, thereby effectively ensuring that the semantic query of the result is not limited to the segmentation label.
2) The optimized multi-scale image pyramid constructed by the invention has the advantages that each layer of image blocks are partially reconstructed by the matching characteristic blocks, so that the scale of each layer of image blocks is not completely consistent but is suitable for the scale, and the accurate attention of scene objects with different sizes is realized.
3) The invention utilizes a conventional three-dimensional reconstruction model to train a feature coding radiation field by using a multi-scale feature coding pyramid obtained by coding a contrast language-image pre-training model, thereby finally obtaining pixel point level information similar to semantics in a scene on the radiation field.
4) The method solves the problems that many three-dimensional reconstruction methods cannot acquire scene semantic information, acquire single and fuzzy semantic information, and cannot perform large-scale semantic interaction and abstract semantic understanding in real time.
5) The intelligent control system has the advantages of good effect, simple realization, clear modules, convenient upgrading and optimizing, strong deployment, direct application on partial notebooks, and effective application in intelligent sensing of environments in the fields of robots, automatic driving and the like.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart for reconstructing a multi-scale image pyramid;
FIG. 3 is a flow chart of model prediction.
Description of the embodiments
The present invention will be described in detail below with reference to the accompanying drawings and examples for the purpose of facilitating understanding of the present invention.
Examples
Referring to fig. 1, semantic three-dimensional reconstruction based on a contrast language-image pre-training model and a multi-scale feature pyramid is performed as follows:
s100: and inputting the input RGB picture sequence meeting the reconstruction standard into a panoramic segmentation model for point label prediction to obtain the characteristic label information of all the pixel points.
S110: and carrying out multi-scale uniform cutting and edge zero filling on the input image to obtain a basic multi-scale image pyramid.
S120: and (3) utilizing the label information of the image pixel points in the S100 to perform cutting, zero filling and center alignment on the continuous image semantic blocks of the same label.
S130: and (3) reconstructing an optimized multi-scale image pyramid by combining the results of the step S110 and the step S120, and inputting a contrast language-image pre-training model to obtain the multi-scale feature coding pyramid.
Referring to fig. 2, an optimized multi-scale image pyramid is constructed as follows:
s200: traversing the image blocks of each layer, finding out the characteristics in the Seg_Imgs of the center point of the image block, comparing whether the Seg_Tile scale of the characteristic image block divided by the characteristics is matched with the scale of the current layer, and setting the scale of the current layer as S i Belonging to S min To S max Between the upper layer and the lower layer, the scale is S i-1 And S is i+1 The feature block scale is S x If and only if S i−1 ∗1.1 < S x ≤ S i+1 ∗ 1.1.1 or S x ≥ S max The matching is called as matching, at the moment, the characteristic image blocks are used for replacing the uniformly divided image blocks in the original pyramid, otherwise, the image blocks are not replaced.
S210: and after matching and replacing all the tiles, the optimized multi-scale Image Pyramid Multiscale Image-Pyramid construction is completed.
S220: putting the multi-scale image pyramid into a CLIP or OpenCLIP contrast language-image pre-training model, coding the image to a space consistent with the text to obtain a multi-scale feature coding pyramid, wherein the pyramid is used as a reference value of semantic content for three-dimensional reconstruction training in the step S140.
Referring to FIG. 3, the model prediction of the present invention, the semantic query part of FIG. 1, includes the following steps:
s300: putting the text of the query into the same CLIP or OpenCLIP contrast language-image pre-training model, and encoding the text into a vector space consistent with the model output.
S310: and for each point, performing dot product on the text coding vector and the image coding vector output by the model to obtain the pixel degree of each point and the query text.
S320: and outputting a similarity graph with aligned points as a model prediction query result, wherein the graph can be visualized by superposing a color table after normalization.
The present invention is not limited to the above embodiments, and variations and advantages which can be conceived by those skilled in the art are included in the present invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is defined.

Claims (4)

1. The semantic three-dimensional reconstruction method for the multi-scale feature pyramid is characterized by comprising the following steps of:
step 1: two-dimensional semantic feature extraction
1-1: semantic segmentation is carried out on the input RGB picture sequence by using a two-dimensional panoramic segmentation model, and a result picture sequence seg_Imgs is obtained;
1-2: according to picture parameters in each result picture sequence seg_imgs, rectangular cutting all continuous features of the original picture, cutting from top, left to bottom, right of the features, and filling length and width zero into max|top-
The original features are moved to the center of the filling diagram to obtain a square feature block set seg_tiles;
step 2: building optimized multi-scale image pyramids
2-1: dividing the original image sequence into multiple scales, setting the dividing ratio as S, uniformly dividing the side length of each image into image block sequences with the side length of the original image being S times, and dividing the dividing ratio from S min =0.05 to S max Uniformly constructing 7 proportions among 0.5 to obtain a 7-layer multi-scale picture pyramid, wherein each layer contains tiles uniformly divided by different proportions;
2-2: traversing the image blocks of each layer, finding out the characteristics in the Seg_Imgs of the center point of the image block, comparing whether the Seg_Tile scale of the characteristic image block divided by the characteristics is matched with the scale of the current layer, and setting the scale of the current layer as S i Belonging to S min To S max Between the upper layer and the lower layer, the scale is S i-1 And S is i+1 The feature block scale is S x When S i−1 ∗1.1 < S x ≤ S i+1 ∗ 1.1.1, or S x ≥ S max When the image pyramid is called matching, the feature image blocks are used for replacing the divided image blocks in the original pyramid, and matching and replacing operations are carried out on all the image blocks to complete the optimized multi-scale image pyramidMultiscale Image Pyramid
Constructing;
step 3: language-image coding neural radiation field training and prediction
3-1: putting the reconstructed Multiscale Image Pyramid into a CLIP or Openclip contrast language-Image pre-training model to enable the coded Image and the text to be consistent in space, and obtaining a Multiscale Feature coded Pyramid Multiscale Feature Pyramid;
3-2: training one language-image coding nerve radiation Field CLIP_Field by using a NeRF three-dimensional reconstruction model by taking a Multiscale Feature coding Pyramid multiscale_feature_pyramid as a reference value GT;
3-3: after training is completed, the query can be carried out, the query text is put into the same contrast language-image pre-training model to obtain text codes, the code vectors and the language-image code neural radiation Field CLIP_Field predicted values under the query view are subjected to dot product according to the dots to obtain vector similarity, and a group of dots with the highest vector similarity are used as the dots of the object corresponding to the text query in the scene.
2. The semantic three-dimensional reconstruction method for the multi-scale feature pyramid, which is disclosed in claim 1, is characterized by comprising the steps of 1-1, extracting two-dimensional features by using a two-dimensional panoramic segmentation model, and carrying out semantic segmentation processing on feature blocks to obtain continuous semantic information of images.
3. The semantic three-dimensional reconstruction method according to claim 1, wherein the steps 2-2: and constructing an optimized multi-scale image pyramid, wherein each layer of image blocks are partially reconstructed by using the matched characteristic blocks, so that the scale of each layer of image blocks is not completely consistent, but the scale is adapted to the scale, and the accurate attention of scene objects with different sizes is realized.
4. The semantic three-dimensional reconstruction method according to claim 1, wherein the step 3-2: and (3) coding a contrast language-image pre-training model by using a NeRF three-dimensional reconstruction model, and using the obtained multi-scale feature coding pyramid for training a feature coding radiation field to obtain pixel point level information similar to semantics in the radiation scene.
CN202311448872.XA 2023-11-02 2023-11-02 Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid Pending CN117274351A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311448872.XA CN117274351A (en) 2023-11-02 2023-11-02 Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311448872.XA CN117274351A (en) 2023-11-02 2023-11-02 Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid

Publications (1)

Publication Number Publication Date
CN117274351A true CN117274351A (en) 2023-12-22

Family

ID=89208204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311448872.XA Pending CN117274351A (en) 2023-11-02 2023-11-02 Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid

Country Status (1)

Country Link
CN (1) CN117274351A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325534A (en) * 2018-09-22 2019-02-12 天津大学 A kind of semantic segmentation method based on two-way multi-Scale Pyramid
WO2019127102A1 (en) * 2017-12-27 2019-07-04 深圳前海达闼云端智能科技有限公司 Information processing method and apparatus, cloud processing device, and computer program product
CN113345082A (en) * 2021-06-24 2021-09-03 云南大学 Characteristic pyramid multi-view three-dimensional reconstruction method and system
WO2022016311A1 (en) * 2020-07-20 2022-01-27 深圳元戎启行科技有限公司 Point cloud-based three-dimensional reconstruction method and apparatus, and computer device
CN114693930A (en) * 2022-03-31 2022-07-01 福州大学 Example segmentation method and system based on multi-scale features and context attention
CN115393410A (en) * 2022-07-18 2022-11-25 华东师范大学 Monocular view depth estimation method based on nerve radiation field and semantic segmentation
CN115775316A (en) * 2022-11-23 2023-03-10 长春理工大学 Image semantic segmentation method based on multi-scale attention mechanism
CN116310098A (en) * 2023-03-01 2023-06-23 电子科技大学 Multi-view three-dimensional reconstruction method based on attention mechanism and variable convolution depth network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019127102A1 (en) * 2017-12-27 2019-07-04 深圳前海达闼云端智能科技有限公司 Information processing method and apparatus, cloud processing device, and computer program product
CN109325534A (en) * 2018-09-22 2019-02-12 天津大学 A kind of semantic segmentation method based on two-way multi-Scale Pyramid
WO2022016311A1 (en) * 2020-07-20 2022-01-27 深圳元戎启行科技有限公司 Point cloud-based three-dimensional reconstruction method and apparatus, and computer device
CN113345082A (en) * 2021-06-24 2021-09-03 云南大学 Characteristic pyramid multi-view three-dimensional reconstruction method and system
CN114693930A (en) * 2022-03-31 2022-07-01 福州大学 Example segmentation method and system based on multi-scale features and context attention
CN115393410A (en) * 2022-07-18 2022-11-25 华东师范大学 Monocular view depth estimation method based on nerve radiation field and semantic segmentation
CN115775316A (en) * 2022-11-23 2023-03-10 长春理工大学 Image semantic segmentation method based on multi-scale attention mechanism
CN116310098A (en) * 2023-03-01 2023-06-23 电子科技大学 Multi-view three-dimensional reconstruction method based on attention mechanism and variable convolution depth network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
动态场景下基于实例分割和三维重建的 多物体单目 SLAM: "动态场景下基于实例分割和三维重建的 多物体单目 SLAM", 《仪 器 仪 表 学 报》, vol. 44, no. 8, 15 August 2023 (2023-08-15) *
程鸣洋;盖绍彦;达飞鹏;: "基于注意力机制的立体匹配网络研究", 光学学报, no. 14, 27 May 2020 (2020-05-27) *

Similar Documents

Publication Publication Date Title
Li et al. Megadepth: Learning single-view depth prediction from internet photos
Kumar et al. Colorization transformer
CN109190752B (en) Image semantic segmentation method based on global features and local features of deep learning
CN111259936B (en) Image semantic segmentation method and system based on single pixel annotation
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
CN110796143A (en) Scene text recognition method based on man-machine cooperation
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN114943876A (en) Cloud and cloud shadow detection method and device for multi-level semantic fusion and storage medium
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN112070137B (en) Training data set generation method, target object detection method and related equipment
CN112084859A (en) Building segmentation method based on dense boundary block and attention mechanism
CN114708297A (en) Video target tracking method and device
CN112700476A (en) Infrared ship video tracking method based on convolutional neural network
CN117274883A (en) Target tracking method and system based on multi-head attention optimization feature fusion network
Blomqvist et al. Baking in the feature: Accelerating volumetric segmentation by rendering feature maps
Pei MSFNet: Multi-scale features network for monocular depth estimation
CN109886996B (en) Visual tracking optimization method
CN117274351A (en) Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid
CN114022371B (en) Defogging device and defogging method based on space and channel attention residual error network
Si et al. Image semantic segmentation based on improved DeepLab V3 model
CN117994525A (en) Point cloud semi-supervised panorama segmentation method based on mixed enhancement and instance information learning
LAIa et al. Immovable Cultural Relics Preservation Through 3D Reconstruction Using NeRF
He et al. Adaptive Voxelization Strategy for 3D Object Detection
CN117557998A (en) Change detection labeling method for remote sensing image
CN115761438A (en) Depth estimation-based saliency target detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination