CN117274351A

CN117274351A - Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid

Info

Publication number: CN117274351A
Application number: CN202311448872.XA
Authority: CN
Inventors: 谭鑫; 纪宇舟; 谢源
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2023-12-22

Abstract

The invention discloses a semantic three-dimensional reconstruction method for a multi-scale feature pyramid, which is characterized by comprising the following steps of: 1) Extracting continuous two-dimensional semantic features by using a two-dimensional panoramic segmentation model, and carrying out rectangular segmentation and zero filling on different features to obtain feature blocks; 2) Constructing a basic multi-scale image pyramid based on original image uniform division, and reconstructing a pyramid with a neighboring scale; 3) Training language-image coding nerve radiation field on the three-dimensional reconstruction nerve radiation field, and carrying out extensive semantic query on the three-dimensional coding nerve radiation field after training. Compared with the prior art, the method solves the problems that many semantic-containing three-dimensional reconstruction methods can only be based on a few fixed tag words, cannot understand abstract semantics and are difficult to interact in real time, provides a more effective and feasible method for realizing wide semantic interaction of three-dimensional reconstruction results, and further enables environment perception and interaction of intelligent household, industrial robots and automatic driving scenes.

Description

Semantic-containing three-dimensional reconstruction method for multi-scale feature pyramid

Technical Field

The invention relates to the technical field of three-dimensional reconstruction and cross-modal, in particular to a semantic three-dimensional reconstruction method based on a contrast language-image pre-training model and a multi-scale feature pyramid.

Background

The three-dimensional reconstruction based on the nerve radiation field refers to the reconstruction of the complete three-dimensional scene content based on a series of RGB pictures, and can directly derive surface grids or models and the like by combining different post-processing processes, so that three-dimensional representation data of the scene can be obtained without professional sampling tools, and quick and simple modeling can be realized in the fields of film and television industry, digital twinning and the like. Some other types of three-dimensional reconstruction are also widely used in robotic environment awareness and automatic path planning.

Modalities refer to the existence of data, such as text, audio, image, video, etc. file formats. The cross-modal refers to the technology of mapping the content of different modes to a unified mode, enriching the current mode by utilizing the data of other modes, and the like, and is an important interaction and knowledge migration technology. Cross-modal techniques are widely used in applications such as text-to-image generation, image understanding, and semantic queries. Due to the considerable application prospect of three-dimensional reconstruction, the related technology is rapidly developed in recent years, and has very great effect improvement on the reconstruction precision and the reconstruction speed.

In the reconstruction method in the prior art, the result does not contain semantic information, so that the reconstruction method cannot be directly applied to the interaction process. In order to pursue the segmentation effect, the existing reconstruction technology of a small amount of information with semantics is almost limited to fixed segmentation labels, and cannot perform extensive semantic query and abstract semantic understanding. This results in a reconstruction result that is also difficult to apply to intelligent environmental semantic perception in the fields of robotics or autopilot, etc.

Disclosure of Invention

The invention aims to provide a semantic three-dimensional reconstruction method of a multi-scale feature pyramid, which aims at the defects of the prior art, adopts a contrast language-image pre-training model to perform cross-mode knowledge migration, uses the multi-scale feature pyramid to realize accurate attention mechanisms of objects with different sizes in a scene, can directly realize large-scale semantic understanding of a wide scene under the condition of almost no obvious overhead increase on the basis of the original three-dimensional reconstruction method, can perform real-time query interaction based on custom words, has the advantages of simple and convenient method, good using effect, clear and convenient module upgrading and optimizing, strong deployment, can be directly applied to partial notebooks, can be effectively applied to intelligent environment perception in the fields of robots, automatic driving and the like, and has good application prospect.

The specific technical scheme for realizing the aim of the invention is as follows: the semantic three-dimensional reconstruction method for the multi-scale feature pyramid is characterized by comprising the following steps of:

step 1: two-dimensional semantic feature extraction

1-1: for an input RGB picture sequence, semantic segmentation is carried out by using any two-dimensional panoramic segmentation model to obtain knots

Fruit picture sequence seg_imgs.

1-2: and (3) according to picture parameters in each result picture sequence seg_imgs, rectangular cutting is carried out on the original picture, all continuous features are cut from (top, left) to (bottom, right) of the features, the length and the width are zero-filled to max (top-bottom and left-right), and finally the original features are moved to the center of the filling picture to obtain a square feature block set seg_Tiles.

Step 2: building optimized multi-scale image pyramids

2-1: the original picture sequence is subjected to multi-scale cutting, and the cutting proportion is set as S, if S=0.025, each picture is uniformly cut into a picture block sequence with the side length being 0.025 times of the side length of the original picture (zero filling is carried out on the edge), and the cutting proportion is set as S _min =0.05 to S _max 7 proportions are uniformly constructed between 0.5, so that a multi-scale picture pyramid of 7 layers is obtained, and each layer contains tiles uniformly divided by different proportions.

2-2: traversing the blocks of each layer, finding the feature in the seg_imgs where the block center point is located, and comparing the seg_tile scale of the feature blocks divided by the feature blocks with the feature blocksWhether the scale of the front layer is matched or not, and the scale of the front layer is set as S _i Belonging to S _min To S _max Between the upper layer and the lower layer, the scale is S _i-1 And S is _i+1 The feature block scale is S _x If and only if S _i−1 ∗1.1 < S _x ≤ S _i+1 ∗ 1.1.1 or S _x ≥ S _max At the moment, the feature Image blocks are used for replacing uniformly divided Image blocks in the original Pyramid, and the optimized multi-scale Image Pyramid Multiscale construction is completed after the matching and replacing operation is carried out on all the Image blocks.

Step 3: language-image coding neural radiation field training and prediction

3-1: and (3) inputting the reconstructed Multiscale Image Pyramid Multiscale Image-Pyramid into a CLIP or Openclip contrast language-Image pre-training model, and coding the Image to a space consistent with the text to obtain a Multiscale Feature coding Pyramid Multiscale Feature-Pyramid.

3-2: the Multiscale Feature encoding Pyramid Multiscale feature_feature_pyramid is used as a reference value (GT), and a NeRF three-dimensional reconstruction model is utilized to train a language-image encoding nerve radiation Field CLIP_Field in an increasing mode.

3-3: when the query is carried out after the training is finished, the query text is put into the same contrast language-image pre-training model to obtain text codes, the code vectors and the predicted value of the language-image coding neural radiation Field CLIP_Field under the query view are subjected to dot products to obtain vector similarity, and therefore a group of points with the highest similarity are points of objects corresponding to the text query in the scene.

Compared with the prior art, the invention has the following beneficial technical effects and remarkable technical progress:

1) The invention only uses the continuous semantic information of the image instead of the segmentation label information for the segmented result, thereby effectively ensuring that the semantic query of the result is not limited to the segmentation label.

2) The optimized multi-scale image pyramid constructed by the invention has the advantages that each layer of image blocks are partially reconstructed by the matching characteristic blocks, so that the scale of each layer of image blocks is not completely consistent but is suitable for the scale, and the accurate attention of scene objects with different sizes is realized.

3) The invention utilizes a conventional three-dimensional reconstruction model to train a feature coding radiation field by using a multi-scale feature coding pyramid obtained by coding a contrast language-image pre-training model, thereby finally obtaining pixel point level information similar to semantics in a scene on the radiation field.

4) The method solves the problems that many three-dimensional reconstruction methods cannot acquire scene semantic information, acquire single and fuzzy semantic information, and cannot perform large-scale semantic interaction and abstract semantic understanding in real time.

5) The intelligent control system has the advantages of good effect, simple realization, clear modules, convenient upgrading and optimizing, strong deployment, direct application on partial notebooks, and effective application in intelligent sensing of environments in the fields of robots, automatic driving and the like.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart for reconstructing a multi-scale image pyramid;

FIG. 3 is a flow chart of model prediction.

Description of the embodiments

The present invention will be described in detail below with reference to the accompanying drawings and examples for the purpose of facilitating understanding of the present invention.

Examples

Referring to fig. 1, semantic three-dimensional reconstruction based on a contrast language-image pre-training model and a multi-scale feature pyramid is performed as follows:

s100: and inputting the input RGB picture sequence meeting the reconstruction standard into a panoramic segmentation model for point label prediction to obtain the characteristic label information of all the pixel points.

S110: and carrying out multi-scale uniform cutting and edge zero filling on the input image to obtain a basic multi-scale image pyramid.

S120: and (3) utilizing the label information of the image pixel points in the S100 to perform cutting, zero filling and center alignment on the continuous image semantic blocks of the same label.

S130: and (3) reconstructing an optimized multi-scale image pyramid by combining the results of the step S110 and the step S120, and inputting a contrast language-image pre-training model to obtain the multi-scale feature coding pyramid.

Referring to fig. 2, an optimized multi-scale image pyramid is constructed as follows:

s200: traversing the image blocks of each layer, finding out the characteristics in the Seg_Imgs of the center point of the image block, comparing whether the Seg_Tile scale of the characteristic image block divided by the characteristics is matched with the scale of the current layer, and setting the scale of the current layer as S _i Belonging to S _min To S _max Between the upper layer and the lower layer, the scale is S _i-1 And S is _i+1 The feature block scale is S _x If and only if S _i−1 ∗1.1 < S _x ≤ S _i+1 ∗ 1.1.1 or S _x ≥ S _max The matching is called as matching, at the moment, the characteristic image blocks are used for replacing the uniformly divided image blocks in the original pyramid, otherwise, the image blocks are not replaced.

S210: and after matching and replacing all the tiles, the optimized multi-scale Image Pyramid Multiscale Image-Pyramid construction is completed.

S220: putting the multi-scale image pyramid into a CLIP or OpenCLIP contrast language-image pre-training model, coding the image to a space consistent with the text to obtain a multi-scale feature coding pyramid, wherein the pyramid is used as a reference value of semantic content for three-dimensional reconstruction training in the step S140.

Referring to FIG. 3, the model prediction of the present invention, the semantic query part of FIG. 1, includes the following steps:

s300: putting the text of the query into the same CLIP or OpenCLIP contrast language-image pre-training model, and encoding the text into a vector space consistent with the model output.

S310: and for each point, performing dot product on the text coding vector and the image coding vector output by the model to obtain the pixel degree of each point and the query text.

S320: and outputting a similarity graph with aligned points as a model prediction query result, wherein the graph can be visualized by superposing a color table after normalization.

The present invention is not limited to the above embodiments, and variations and advantages which can be conceived by those skilled in the art are included in the present invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is defined.

Claims

1. The semantic three-dimensional reconstruction method for the multi-scale feature pyramid is characterized by comprising the following steps of:

step 1: two-dimensional semantic feature extraction

1-1: semantic segmentation is carried out on the input RGB picture sequence by using a two-dimensional panoramic segmentation model, and a result picture sequence seg_Imgs is obtained;

1-2: according to picture parameters in each result picture sequence seg_imgs, rectangular cutting all continuous features of the original picture, cutting from top, left to bottom, right of the features, and filling length and width zero into max|top-

The original features are moved to the center of the filling diagram to obtain a square feature block set seg_tiles;

step 2: building optimized multi-scale image pyramids

2-1: dividing the original image sequence into multiple scales, setting the dividing ratio as S, uniformly dividing the side length of each image into image block sequences with the side length of the original image being S times, and dividing the dividing ratio from S _min =0.05 to S _max Uniformly constructing 7 proportions among 0.5 to obtain a 7-layer multi-scale picture pyramid, wherein each layer contains tiles uniformly divided by different proportions;

2-2: traversing the image blocks of each layer, finding out the characteristics in the Seg_Imgs of the center point of the image block, comparing whether the Seg_Tile scale of the characteristic image block divided by the characteristics is matched with the scale of the current layer, and setting the scale of the current layer as S _i Belonging to S _min To S _max Between the upper layer and the lower layer, the scale is S _i-1 And S is _i+1 The feature block scale is S _x When S _i−1 ∗1.1 < S _x ≤ S _i+1 ∗ 1.1.1, or S _x ≥ S _max When the image pyramid is called matching, the feature image blocks are used for replacing the divided image blocks in the original pyramid, and matching and replacing operations are carried out on all the image blocks to complete the optimized multi-scale image pyramidMultiscale Image Pyramid

Constructing;

step 3: language-image coding neural radiation field training and prediction

3-1: putting the reconstructed Multiscale Image Pyramid into a CLIP or Openclip contrast language-Image pre-training model to enable the coded Image and the text to be consistent in space, and obtaining a Multiscale Feature coded Pyramid Multiscale Feature Pyramid;

3-2: training one language-image coding nerve radiation Field CLIP_Field by using a NeRF three-dimensional reconstruction model by taking a Multiscale Feature coding Pyramid multiscale_feature_pyramid as a reference value GT;

3-3: after training is completed, the query can be carried out, the query text is put into the same contrast language-image pre-training model to obtain text codes, the code vectors and the language-image code neural radiation Field CLIP_Field predicted values under the query view are subjected to dot product according to the dots to obtain vector similarity, and a group of dots with the highest vector similarity are used as the dots of the object corresponding to the text query in the scene.

2. The semantic three-dimensional reconstruction method for the multi-scale feature pyramid, which is disclosed in claim 1, is characterized by comprising the steps of 1-1, extracting two-dimensional features by using a two-dimensional panoramic segmentation model, and carrying out semantic segmentation processing on feature blocks to obtain continuous semantic information of images.

3. The semantic three-dimensional reconstruction method according to claim 1, wherein the steps 2-2: and constructing an optimized multi-scale image pyramid, wherein each layer of image blocks are partially reconstructed by using the matched characteristic blocks, so that the scale of each layer of image blocks is not completely consistent, but the scale is adapted to the scale, and the accurate attention of scene objects with different sizes is realized.

4. The semantic three-dimensional reconstruction method according to claim 1, wherein the step 3-2: and (3) coding a contrast language-image pre-training model by using a NeRF three-dimensional reconstruction model, and using the obtained multi-scale feature coding pyramid for training a feature coding radiation field to obtain pixel point level information similar to semantics in the radiation scene.