CN117953224A - Open vocabulary 3D panorama segmentation method and system - Google Patents

Open vocabulary 3D panorama segmentation method and system Download PDF

Info

Publication number
CN117953224A
CN117953224A CN202410357997.XA CN202410357997A CN117953224A CN 117953224 A CN117953224 A CN 117953224A CN 202410357997 A CN202410357997 A CN 202410357997A CN 117953224 A CN117953224 A CN 117953224A
Authority
CN
China
Prior art keywords
mask
segmentation
segmentation mask
model
point cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410357997.XA
Other languages
Chinese (zh)
Other versions
CN117953224B (en
Inventor
严考碧
张鹏飞
苏江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DMAI Guangzhou Co Ltd
Original Assignee
DMAI Guangzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DMAI Guangzhou Co Ltd filed Critical DMAI Guangzhou Co Ltd
Priority to CN202410357997.XA priority Critical patent/CN117953224B/en
Priority claimed from CN202410357997.XA external-priority patent/CN117953224B/en
Publication of CN117953224A publication Critical patent/CN117953224A/en
Application granted granted Critical
Publication of CN117953224B publication Critical patent/CN117953224B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a 3D panorama segmentation method and system for open vocabulary, and belongs to the technical field of computer vision and image processing. The method comprises the steps of obtaining 3D point cloud data of a target scene; inputting the 3D point cloud data into a 3D segmentation network, and acquiring a 3D segmentation mask, wherein the 3D segmentation mask comprises a 3D instance segmentation mask and a 3D semantic segmentation mask; mapping the 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, wherein the 2D segmentation mask comprises a 2D instance segmentation mask and a 2D semantic segmentation mask; associating the 2D segmentation mask with a text label using the CLIP model; the text labels are associated to the 3D segmentation mask according to the correspondence of the 2D segmentation mask and the 3D segmentation mask. According to the invention, the 3D data is not required to be marked manually, the existing rich 2D data and 2D segmentation model are utilized to generate the pseudo 3D label to guide the 3D model to output the segmentation result, and complex and expensive manual 3D segmentation marking is avoided.

Description

Open vocabulary 3D panorama segmentation method and system
Technical Field
The invention relates to the technical field of computer vision and image processing, in particular to an open vocabulary 3D panorama segmentation method and system.
Background
Currently, in the field of computer vision, an image segmentation method mainly comprises three major categories, namely a 3D segmentation method based on deep learning, an open vocabulary 2D segmentation method and an open vocabulary 3D segmentation method. The 3D segmentation method based on deep learning generally requires manual labeling of 3D segmentation data for model training, and the model can only segment the categories seen by the training set; the open vocabulary 2D segmentation method expands the segmentation capability of the model, has the capability of segmenting the unseen category, but only can process 2D image data, and the segmentation result only can provide 2D information, and cannot completely and accurately give out the physical information of the segmented object in the real world; the open vocabulary 3D segmentation method has the capability of segmenting the types of the objects which are not seen in the training set, and meanwhile, the segmentation result is closer to the real physical world, so that more comprehensive information such as the physical shape, the physical size and the spatial position of the objects can be obtained, the perceptibility of the objects and the real world scene is improved, and accurate and comprehensive three-dimensional information is provided for downstream tasks. Therefore, the open vocabulary 3D segmentation method is widely researched and applied.
In the prior art, an Open Vocabulary 3D object detection method is proposed in the published paper Open-Vocabulolar Point-Cloud Object Detection without D analysis, but the method requires manual labeling of 3D segmentation masks to train a 3D segmentation model, 3D segmentation data labeling is complex, labor cost is high, and an Open source data set is few.
Therefore, how to overcome the problems of less open vocabulary 3D panorama segmentation data and difficult labeling is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides an open vocabulary 3D panorama segmentation method and system.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
on the one hand, the invention discloses an open vocabulary 3D panorama segmentation method, which comprises the following steps:
Acquiring 3D point cloud data of a target scene;
Inputting the 3D point cloud data into a 3D segmentation network, and acquiring a 3D segmentation mask, wherein the 3D segmentation mask comprises a 3D instance segmentation mask and a 3D semantic segmentation mask;
mapping the 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, wherein the 2D segmentation mask comprises a 2D instance segmentation mask and a 2D semantic segmentation mask;
associating the 2D segmentation mask with a text label using the CLIP model;
the text labels are associated to the 3D segmentation mask according to the correspondence of the 2D segmentation mask and the 3D segmentation mask.
Further, the step of acquiring 3D point cloud data of the target scene specifically includes the following steps:
acquiring RGBD images of a target scene by using a depth camera;
and carrying out reconstruction calculation on the RGBD image by utilizing the internal parameters of the depth camera to obtain 3D point cloud data of the target scene, wherein the specific calculation formula is as follows:
Wherein x, y and z represent coordinates of the 3D point cloud data; u, v denote pixel coordinates of the RGBD image; depth represents the depth value of the RGBD image; fx, fy, cx, cy denotes RGB-D camera parameters, where cx, cy denote offset values of the RGB-D camera optical axis in the image coordinate system; fx=f/dx, fy=f/dy; f denotes a focal length of the RGB-D camera, dx, dy denote physical lengths of one pixel in x-axis and y-axis directions of the image coordinate system, respectively.
Further, the 3D segmentation network includes a 3D instance segmentation model and a 3D semantic segmentation model; wherein the 3D instance segmentation model comprises a Mask3D model and the like, and the 3D semantic segmentation model comprises a SEGCloud model or a LabelMaker model and the like.
Further, mapping the 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, which is implemented by the following formula:
;/> Wherein, u_mask and v_mask represent 2D coordinates corresponding to the 3D segmentation mask; x_mask, y_mask, v_mask represent coordinates of the 3D segmentation mask; fx, fy, cx, cy denotes RGB-D camera parameters, where cx, cy denote offset values of the RGB-D camera optical axis in the image coordinate system; fx=f/dx, fy=f/dy; f denotes a focal length of the RGB-D camera, dx, dy denote physical lengths of one pixel in x-axis and y-axis directions of the image coordinate system, respectively.
Further, the 2D segmentation mask is associated with a text label by utilizing the CLIP model, and the method specifically comprises the following steps of:
Cutting out an image corresponding to the segmentation mask according to the 2D segmentation mask;
inputting the obtained corresponding image into an image encoder in the CLIP model to obtain a target 2D mask characteristic;
obtaining text characteristics corresponding to all words in the vocabulary by using a text encoder in the CLIP model;
and matching the target 2D mask features with Text features corresponding to all words to obtain Text labels of the target 2D mask features.
Further, the method further comprises the steps of:
And calculating category-agnostic 3D mask loss by using the pseudo 3D segmentation mask, and performing supervision verification on the 3D segmentation mask acquired through the 3D point cloud data.
Further, the pseudo 3D segmentation mask specifically includes the following steps:
inputting the RGB image of the target scene into a 2D segmentation pre-training model to obtain a 2D segmentation mask;
And mapping the obtained 2D segmentation mask to a 3D space by using camera parameters, and obtaining a pseudo 3D segmentation mask.
Further, the category-agnostic 3D mask loss is calculated using the pseudo 3D segmentation mask, specifically including the following calculation formula:
; where L dice represents a category-agnostic 3D mask loss, Y represents a target 3D mask corresponding to the pseudo 3D segmentation mask, and X represents a 3D segmentation mask obtained by 3D point cloud data through a 3D segmentation network.
On the other hand, the invention also discloses an open vocabulary 3D panorama segmentation system, which comprises: a 3D mask generation module and a mask tag alignment module;
the 3D mask generation module is used for converting 3D point cloud data of a target scene into a 3D segmentation mask through a 3D segmentation network, wherein the 3D segmentation mask comprises a 3D instance segmentation mask and a 3D semantic segmentation mask;
the mask tag alignment module performs the following operations:
Mapping the 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, wherein the 2D segmentation mask comprises a 2D instance segmentation mask and a 2D semantic segmentation mask;
associating the 2D segmentation mask with a text label using the CLIP model;
the text labels are associated to the 3D segmentation mask according to the correspondence of the 2D segmentation mask and the 3D segmentation mask.
Preferably, the system further comprises a pseudo 3D segmentation mask supervision and verification module, wherein the pseudo 3D segmentation mask supervision and verification module is used for performing supervision and verification on a 3D segmentation mask acquired through the 3D point cloud data by using the pseudo 3D segmentation mask to calculate a category-agnostic 3D mask loss.
Compared with the prior art, the invention discloses the open vocabulary 3D panorama segmentation method and system, which have the following beneficial effects:
1) Compared with the open vocabulary 2D panorama segmentation method, the open vocabulary 3D panorama segmentation method has the advantages that the result of the open vocabulary 3D panorama segmentation adopted by the method is closer to the real physical world, more comprehensive information such as the physical shape, the physical size and the spatial position of an object can be obtained, and the perceptibility of the object and the real world scene is improved.
2) According to the invention, the 3D data is not required to be marked manually, the existing rich 2D data and the most advanced 2D segmentation model are utilized to generate the pseudo 3D label to guide the 3D model to output the segmentation result, and the complex and expensive manual 3D segmentation marking is avoided.
3) Compared with the traditional 3D segmentation method or the 3D segmentation method based on deep learning, the segmentation capability of the 3D model to the point cloud is expanded, the generalization capability of the model is enhanced, the segmentation category can be expanded through a vocabulary, the 3D panoramic segmentation capability of open vocabulary is realized, and the point cloud can be accurately segmented.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of an open vocabulary 3D panorama segmentation method provided by the present invention.
Fig. 2 is a block diagram of an open vocabulary 3D panorama segmentation system according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in fig. 1, the embodiment of the invention discloses an open vocabulary 3D panorama segmentation method, which comprises the following steps:
Acquiring 3D point cloud data of a target scene;
inputting the 3D point cloud data into a 3D segmentation network, and acquiring a 3D segmentation mask, wherein the 3D segmentation mask comprises a 3D instance segmentation mask and a 3D semantic segmentation mask;
mapping the obtained 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, wherein the 2D segmentation mask comprises a 2D instance segmentation mask and a 2D semantic segmentation mask;
associating the obtained 2D segmentation mask with a text label by using a CLIP model;
the text labels are associated to the 3D segmentation mask according to the correspondence of the 2D segmentation mask and the 3D segmentation mask.
In a specific embodiment, the text labels are associated with the obtained 2D segmentation masks by using a CLIP model, and specifically comprises the following steps:
Cutting out an image corresponding to the segmentation mask according to the 2D segmentation mask; inputting the obtained corresponding image into an image encoder in the CLIP model to obtain a target 2D mask characteristic; obtaining text characteristics corresponding to all words in the vocabulary by using a text encoder in the CLIP model; and matching the target 2D mask features with text features corresponding to all words to obtain text labels of the target 2D mask features.
In the embodiment of the invention, 3D (three-dimensional) segmentation masks are obtained through a 3D segmentation network according to 3D point cloud data of a target scene, and specifically, 3D instance segmentation masks are obtained through a 3D instance segmentation model according to the 3D point cloud data; the 3D point cloud data obtains a 3D semantic segmentation mask through a 3D semantic segmentation model; and then mapping the obtained 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask.
In the present invention, a vocabulary is understood to mean a set of text labels with semantics, such as cats, dogs, tables, which is a vocabulary. Thus, inputting text labels with semantics in the vocabulary into a text encoder of the CLIP model to obtain text features; clipping the corresponding target area from the RGB image according to the 2D instance/semantic mask, and inputting the target area into an image encoder of the CLIP model to obtain corresponding 2D mask features (2D visual features); the 2D instance/semantic segmentation mask is then mapped to the text labels in the vocabulary by comparing the 2D mask features corresponding to the 2D instance/semantic segmentation target to the features of the text labels in the vocabulary. And associating the text label to the 3D segmentation mask according to the corresponding relation between the 2D segmentation mask and the 3D segmentation mask.
Still further the method further comprises: and calculating category-agnostic 3D mask loss by using the pseudo 3D segmentation mask, and performing supervision verification on the 3D segmentation mask acquired through the 3D point cloud data.
The process of the present invention is described in detail below by way of specific use examples:
1. And (3) data acquisition: in the implementation of the invention, an RGBD image of a target scene is acquired by using an RGB-D camera, the RGBD image is used, and a 3D point cloud corresponding to the RGBD image is reconstructed according to internal parameters of the camera, and the specific calculation method is as follows:
(1);
(2);
(3);
Wherein x, y, z represent coordinates of the 3D point cloud data; u, v denote pixel coordinates of the RGBD image; depth represents the depth value of the RGBD image; fx, fy, cx, cy denotes RGB-D camera parameters, where cx, cy denote offset values of the RGB-D camera optical axis in the image coordinate system; fx=f/dx, fy=f/dy; f denotes a focal length of the RGB-D camera, dx, dy denote physical lengths of one pixel in x-axis and y-axis directions of the image coordinate system, respectively.
2. Model construction and data processing:
3D point cloud data are input into a 3D segmentation network, a 3D segmentation mask is obtained, and the 3D segmentation network in the process can specifically comprise a 3D instance segmentation model and a 3D semantic segmentation model.
A3D instance segmentation model such as Mask3D can realize 3D instance segmentation (3D Thing segmentation), 3D point cloud is input into the 3D instance segmentation model to output a 3D instance segmentation Mask (3D Thing Mask), corresponding features can be obtained from a network structure according to the Mask, a minimum circumscribed frame containing the Mask can be obtained according to the 3D instance segmentation Mask, and then the features corresponding to the minimum circumscribed frame are obtained on a feature layer of the network structure through the ROI.
3D semantic segmentation (3D Stuff segmentation) can be achieved by a 3D semantic segmentation model such as SEGCloud, labelMaker, 3D point cloud is input into the 3D semantic segmentation model to output a 3D semantic segmentation Mask (3D Stuff Mask), corresponding features of the 3D semantic segmentation model can be obtained from a network structure according to the 3D semantic segmentation Mask, a minimum circumscribed frame containing Mask can be obtained according to 3D Stuff Mask, and then the features corresponding to the minimum circumscribed frame are obtained on a feature layer of the network structure through the ROI.
Mapping the 3D point cloud data to a 2D space through a segmentation mask of a 3D segmentation network to obtain a 2D segmentation mask, recording the mapping relation between the 2D segmentation mask and the 2D segmentation mask, cutting out a corresponding image according to the 2D segmentation mask, and inputting the image into an image encoder in a CLIP to obtain 2D segmentation mask characteristics; and obtaining text characteristics corresponding to the words in the vocabulary by using a text encoder in the CLIP.
The contrast loss function of the 3D segmentation Mask feature (3D Mask Features) and the 2D segmentation Mask feature (2D Mask Features) can be constructed according to the corresponding relation between the 3D Mask and the 2D Mask.
According to the 2D segmentation mask Features, corresponding Text Features (Text Features) can be found through the CLIP model, so that a contrast loss function of the 3D segmentation mask Features and the Text Features is constructed, and a calculation formula of the contrast loss function is as follows:
Where M is the number of samples, M is the number of positive samples, h i represents the ith sample, h i represents the 3D mask target feature when calculating the contrast loss function of the 3D mask target feature and the 2D mask target feature, h t represents the 2D mask feature corresponding to h i (each 3D mask has a corresponding 2D mask after mapping), is the positive sample of h i, and is not the 2D mask feature corresponding to h i, referred to as the negative sample, and h j represents the positive sample that can be h i, and possibly the negative sample of h i; when the contrast loss function of the 3D mask target feature and the Text label feature is calculated, h i represents the 3D mask target feature, h t represents the Text label feature corresponding to h i, and is a positive sample of h i, and the Text label feature not corresponding to h i is called a negative sample. h j can be either a positive sample of h i or a negative sample of h i.
The text labels are correspondingly realized by using a CLIP model, and the alignment between the images and the texts can be realized by using the CLIP model in the embodiment of the invention, wherein the CLIP model comprises an image encoder and a text encoder, the image encoder is used for encoding image characteristics, and the text encoder is used for encoding text characteristics.
The 3D point cloud data may generate a 3D segmentation mask (including a 3D instance segmentation mask and a 3D semantic segmentation mask) via a 3D segmentation network, the 3D segmentation mask being mapped to a 2D space according to camera internal parameters, the calculation formula being as follows:
(4);
(5);
Wherein, u_mask, v_mask represents the 2D coordinates corresponding to the 3D segmentation mask; x_mask, y_mask, v_mask represent coordinates of the 3D segmentation mask; fx, fy, cx, cy denotes RGB-D camera parameters, where cx, cy denote offset values of the RGB-D camera optical axis in the image coordinate system; fx=f/dx, fy=f/dy; f denotes a focal length of the RGB-D camera, dx, dy denote physical lengths of one pixel in x-axis and y-axis directions of the image coordinate system, respectively.
After mapping the 3D segmentation mask to a 2D space, cutting out a corresponding image area according to the 2D segmentation mask, inputting the cut-out image area to an image encoder in a CLIP model, outputting image characteristics, inputting a vocabulary to a text encoder in the CLIP model, and extracting text characteristics; the CLIP model finds a text label corresponding to the 2D segmentation mask by using the extracted text label features and the 2D segmentation mask features, and associates the text label to the 3D segmentation mask according to the corresponding relation between the 2D segmentation mask and the 3D segmentation mask, so that the alignment of the 3D segmentation mask features and the text label features and the corresponding 2D mask features is realized.
In an improved embodiment, the method further comprises: and calculating category-agnostic 3D mask loss by using the pseudo 3D segmentation mask, and performing supervision verification on the 3D segmentation mask acquired through the 3D point cloud data.
Pseudo 3D segmentation mask the Pseudo 3D instance segmentation/semantic segmentation mask (Pseudo 3D Thing/Stuff Mask) generation process is as follows.
Firstly, the RGB image of the current scene is subjected to 2D mask segmentation through a 2D pre-segmentation model, wherein the 2D pre-segmentation model comprises a 2D instance segmentation model and a 2D semantic segmentation model.
Specifically, 2D instance segmentation models such as Mask R-CNN, CASCADE MASK R-CNN, HTC and the like can realize 2D Thing segmentation, and the segmentation Mask of 2D instance segmentation can be obtained by directly inputting the RGB image into the instance model.
2D semantic segmentation can be realized by a 2D semantic segmentation model such as FCN, deepLab, unet, a RGB image is directly input into the semantic segmentation model to obtain a segmentation Mask of the 2D semantic segmentation, coordinates corresponding to the 2D segmentation Mask (2D Thing/Stuff Mask) are mapped according to a formula (1), a formula (2) and a formula (3) to obtain a 3D Mask, then the 3D Mask is subjected to clustering screening to obtain a pseudo 3D segmentation Mask, a specific clustering method comprises K-Means, and then points with large errors are obtained when the clustered Mask is removed to carry out coordinate mapping to obtain a final pseudo 3D segmentation Mask.
The 3D segmentation Mask obtained by the 3D point cloud calculates a category-agnostic 3D Mask loss, wherein the category-agnostic 3D Mask loss adopts a Dice loss, and the calculation formula is as follows:
; where L dice represents a category-agnostic 3D mask loss, Y represents a target 3D mask corresponding to the pseudo 3D segmentation mask, and X represents a 3D segmentation mask obtained by 3D point cloud data through a 3D segmentation network.
According to the invention, a segmentation result of a 3D segmentation network is mapped to a 2D space to obtain a 2D Mask, a mapping relation between the two is recorded, a corresponding image is cut according to the 2D Mask, an image encoder input into a CLIP is used for obtaining 2D Mask Features, a Text encoder in the CLIP is used for obtaining Text Features corresponding to words in a vocabulary, and a contrast loss function of the 3D Mask Features and the 2D Mask Features can be constructed according to the corresponding relation between the 3D Mask and the 2D Mask.
According to the 2D Mask Features, the corresponding Text Features can be found through the CLIP model, so that a contrast loss function of the 3D Mask Features and the Text Features is constructed, and the contrast loss function has the following calculation formula:
Where M is the number of samples, M is the number of positive samples, h i represents the ith sample, h i represents the 3D mask target feature when calculating the contrast loss function of the 3D mask target feature and the 2D mask target feature, h t represents the 2D mask feature corresponding to h i (each 3D mask has a corresponding 2D mask after mapping), is the positive sample of h i, and is not the 2D mask feature corresponding to h i, referred to as the negative sample, and h j is the positive sample that can be h i, and also can be the negative sample of h i; when the contrast loss function of the 3D mask target feature and the Text label feature is calculated, h i represents the 3D mask target feature, h t represents the Text label feature corresponding to h i, which is a positive sample of h i, and h j is a positive sample which is not the Text label feature corresponding to h i, i.e. h i, and possibly a negative sample of h i.
Example 2
According to the open vocabulary 3D panorama segmentation method of embodiment 1, the invention also discloses an open vocabulary 3D panorama segmentation system, which is mainly realized by a computer system and a corresponding software module thereof, and specifically comprises the following steps: a 3D mask generation module and a mask tag alignment module; the 3D mask generation module is used for converting 3D point cloud data of the target scene into a 3D segmentation mask through a 3D segmentation network, wherein the 3D segmentation mask comprises a 3D instance segmentation mask and a 3D semantic segmentation mask; the mask tag alignment module performs the following operations: mapping the 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, wherein the 2D segmentation mask comprises a 2D instance segmentation mask and a 2D semantic segmentation mask; associating the 2D segmentation mask with a text label using the CLIP model; the text labels are associated to the 3D segmentation mask according to the correspondence of the 2D segmentation mask and the 3D segmentation mask.
As a preferred embodiment, the system further comprises a pseudo 3D segmentation mask supervision and verification module, wherein the pseudo 3D segmentation mask supervision and verification module is configured to calculate a category-agnostic 3D mask loss by using the pseudo 3D segmentation mask, and perform supervision and verification on the 3D segmentation mask acquired through the 3D point cloud data.
The overall system structure block diagram of the open vocabulary 3D panorama segmentation system is shown in fig. 2, and the construction of each module and the data processing process can refer to the open vocabulary 3D panorama segmentation method of embodiment 1.
The specific application prediction process of the open vocabulary 3D panorama segmentation system according to the present invention is briefly described below by way of specific example applications.
Firstly, an RGBD image is input to obtain a 3D point cloud, then the 3D point cloud is input to a 3D mask generation module, and a corresponding 3D instance segmentation mask and a 3D semantic segmentation mask are output.
Using the obtained 3D segmentation mask (3D instance segmentation mask and 3D semantic segmentation mask), mapping 3D to 2D space according to formula (4), formula (5) to obtain a corresponding 2D segmentation mask, and recording the correspondence of the 3D segmentation mask and the 2D segmentation mask.
Specific examples are selecting a 3D segmentation mask (assumed to be a sofa) segmented by a model in this process, mapping the formula (5) to a 2D segmentation mask according to the formula (4), and recording the mapping relationship between the 3D segmentation mask (assumed to be a sofa) and the 2D segmentation mask.
And cutting out a corresponding image according to the obtained 2D segmentation mask, inputting the image into an image encoder in the CLIP model to obtain image characteristics, inputting words in a vocabulary into a text encoder in the CLIP model to obtain text characteristics, comparing the similarity between the image characteristics and the text characteristics, and taking a text corresponding to the text characteristics with the highest score as a label corresponding to the image.
Specifically, in the previous step, a 2D segmentation mask (assumed to be a sofa) is obtained, a corresponding image is cut out according to the 2D segmentation mask (assumed to be a sofa), then the image encoder input into the CLIP obtains its corresponding image feature, the words in the vocabulary are all input into the text encoder of the CLIP to obtain text features, then the features of the 2D segmentation mask (assumed to be a sofa) and the features of all words in the vocabulary are compared to find the most similar text (sofa), and this tag (sofa) is assigned to the 2D segmentation mask (assumed to be a sofa).
In the previous step, labels are allocated to the 2D segmentation masks according to the CLIP model, and the corresponding relation between the recorded 3D segmentation masks and the 2D segmentation masks is recorded, so that the 3D segmentation masks corresponding to the 2D segmentation masks can be found, and text labels allocated to the 2D segmentation masks are labels of the 3D segmentation masks corresponding to the 2D segmentation masks, thereby realizing open vocabulary 3D point cloud panorama segmentation. If the label with the obtained 2D Mask (supposedly a sofa) is a sofa, according to the mapping relation between the 2D Mask and the 3D Mask, finding a 3D Mask corresponding to the 2D Mask (sofa), and the found label of the 3D Mask is the sofa.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The 3D panorama segmentation method for the open vocabulary is characterized by comprising the following steps of:
Acquiring 3D point cloud data of a target scene;
Inputting the 3D point cloud data into a 3D segmentation network, and acquiring a 3D segmentation mask, wherein the 3D segmentation mask comprises a 3D instance segmentation mask and a 3D semantic segmentation mask;
mapping the 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, wherein the 2D segmentation mask comprises a 2D instance segmentation mask and a 2D semantic segmentation mask;
associating the 2D segmentation mask with a text label using the CLIP model;
the text labels are associated to the 3D segmentation mask according to the correspondence of the 2D segmentation mask and the 3D segmentation mask.
2. The method for 3D panorama segmentation of an open vocabulary according to claim 1, wherein the obtaining of 3D point cloud data of a target scene comprises the steps of:
acquiring RGBD images of a target scene by using a depth camera;
and carrying out reconstruction calculation on the RGBD image by utilizing the internal parameters of the depth camera to obtain 3D point cloud data of the target scene, wherein the specific calculation formula is as follows:
Wherein x, y and z represent coordinates of the 3D point cloud data; u, v denote pixel coordinates of the RGBD image; depth represents the depth value of the RGBD image; fx, fy, cx, cy denotes RGB-D camera parameters, where cx, cy denote offset values of the RGB-D camera optical axis in the image coordinate system; fx=f/dx, fy=f/dy; f denotes a focal length of the RGB-D camera, dx, dy denote physical lengths of one pixel in x-axis and y-axis directions of the image coordinate system, respectively.
3. The method of claim 1, wherein the open vocabulary 3D panorama segmentation method comprises,
The 3D segmentation network comprises a 3D instance segmentation model and a 3D semantic segmentation model; wherein the 3D instance segmentation model comprises a Mask3D model, and the 3D semantic segmentation model comprises a SEGCloud model or a LabelMaker model.
4. The open vocabulary 3D panorama segmentation method according to claim 1, wherein mapping the 3D segmentation mask into a 2D mask space results in a corresponding 2D segmentation mask by the following formula:
Wherein, u_mask and v_mask represent 2D coordinates corresponding to the 3D segmentation mask; x_mask, y_mask, v_mask represent coordinates of the 3D segmentation mask; fx, fy, cx, cy denotes RGB-D camera parameters, where cx, cy denote offset values of the RGB-D camera optical axis in the image coordinate system; fx=f/dx, fy=f/dy; f denotes a focal length of the RGB-D camera, dx, dy denote physical lengths of one pixel in x-axis and y-axis directions of the image coordinate system, respectively.
5. The open vocabulary 3D panorama segmentation method according to claim 1, wherein the 2D segmentation mask is associated with a text label using a CLIP model, comprising the steps of:
Cutting out an image corresponding to the segmentation mask according to the 2D segmentation mask;
inputting the obtained corresponding image into an image encoder in the CLIP model to obtain a target 2D mask characteristic;
obtaining text characteristics corresponding to all words in the vocabulary by using a text encoder in the CLIP model;
and matching the target 2D mask features with Text features corresponding to all words to obtain Text labels of the target 2D mask features.
6. The method of claim 1, further comprising:
And calculating category-agnostic 3D mask loss by using the pseudo 3D segmentation mask, and performing supervision verification on the 3D segmentation mask acquired through the 3D point cloud data.
7. The open vocabulary 3D panorama segmentation method according to claim 6, wherein the pseudo 3D segmentation mask is obtained by:
inputting the RGB image of the target scene into a 2D segmentation pre-training model to obtain a 2D segmentation mask;
And mapping the obtained 2D segmentation mask to a 3D space by using camera parameters, and obtaining a pseudo 3D segmentation mask.
8. The open vocabulary 3D panorama segmentation method according to claim 6, wherein the class agnostic 3D mask loss is calculated using a pseudo 3D segmentation mask, comprising the following calculation formula:
Where L dice represents a category-agnostic 3D mask loss, Y represents a target 3D mask corresponding to the pseudo 3D segmentation mask, and X represents a 3D segmentation mask obtained by 3D point cloud data through a 3D segmentation network.
9. An open vocabulary 3D panorama segmentation system, comprising: a 3D mask generation module and a mask tag alignment module;
the 3D mask generation module is used for converting 3D point cloud data of a target scene into a 3D segmentation mask through a 3D segmentation network, wherein the 3D segmentation mask comprises a 3D instance segmentation mask and a 3D semantic segmentation mask;
the mask tag alignment module performs the following operations:
Mapping the 3D segmentation mask to a 2D mask space to obtain a corresponding 2D segmentation mask, wherein the 2D segmentation mask comprises a 2D instance segmentation mask and a 2D semantic segmentation mask;
associating the 2D segmentation mask with a text label using the CLIP model;
the text labels are associated to the 3D segmentation mask according to the correspondence of the 2D segmentation mask and the 3D segmentation mask.
10. An open vocabulary 3D panorama segmentation system according to claim 9, further comprising a pseudo 3D segmentation mask supervision verification module for supervision verification of a 3D segmentation mask obtained from the 3D point cloud data using the pseudo 3D segmentation mask calculation category agnostic 3D mask loss.
CN202410357997.XA 2024-03-27 Open vocabulary 3D panorama segmentation method and system Active CN117953224B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410357997.XA CN117953224B (en) 2024-03-27 Open vocabulary 3D panorama segmentation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410357997.XA CN117953224B (en) 2024-03-27 Open vocabulary 3D panorama segmentation method and system

Publications (2)

Publication Number Publication Date
CN117953224A true CN117953224A (en) 2024-04-30
CN117953224B CN117953224B (en) 2024-07-05

Family

ID=

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037279A (en) * 2020-09-04 2020-12-04 贝壳技术有限公司 Article position identification method and device, storage medium and electronic equipment
US20210150227A1 (en) * 2019-11-15 2021-05-20 Argo AI, LLC Geometry-aware instance segmentation in stereo image capture processes
US20210174529A1 (en) * 2019-12-06 2021-06-10 Mashgin Inc. System and method for identifying items
CN113269862A (en) * 2021-05-31 2021-08-17 中国科学院自动化研究所 Scene-adaptive fine three-dimensional face reconstruction method, system and electronic equipment
CN116935356A (en) * 2023-07-28 2023-10-24 中国科学技术大学 Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method
CN117274388A (en) * 2023-10-17 2023-12-22 四川大学 Unsupervised three-dimensional visual positioning method and system based on visual text relation alignment
CN117671688A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Segmentation recognition and text description method and system based on hintable segmentation model
CN117745944A (en) * 2023-12-20 2024-03-22 北京百度网讯科技有限公司 Pre-training model determining method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210150227A1 (en) * 2019-11-15 2021-05-20 Argo AI, LLC Geometry-aware instance segmentation in stereo image capture processes
US20210174529A1 (en) * 2019-12-06 2021-06-10 Mashgin Inc. System and method for identifying items
CN112037279A (en) * 2020-09-04 2020-12-04 贝壳技术有限公司 Article position identification method and device, storage medium and electronic equipment
CN113269862A (en) * 2021-05-31 2021-08-17 中国科学院自动化研究所 Scene-adaptive fine three-dimensional face reconstruction method, system and electronic equipment
CN116935356A (en) * 2023-07-28 2023-10-24 中国科学技术大学 Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method
CN117274388A (en) * 2023-10-17 2023-12-22 四川大学 Unsupervised three-dimensional visual positioning method and system based on visual text relation alignment
CN117671688A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Segmentation recognition and text description method and system based on hintable segmentation model
CN117745944A (en) * 2023-12-20 2024-03-22 北京百度网讯科技有限公司 Pre-training model determining method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JUNBO ZHANG; RUNPEI DONG; KAISHENG MA,: "CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP", 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 6 October 2023 (2023-10-06), pages 2040 - 2047 *

Similar Documents

Publication Publication Date Title
Luo et al. Traffic sign recognition using a multi-task convolutional neural network
CN108376244B (en) Method for identifying text font in natural scene picture
CN107301414B (en) Chinese positioning, segmenting and identifying method in natural scene image
CN106846306A (en) A kind of ultrasonoscopy automatic describing method and system
CN111160352A (en) Workpiece metal surface character recognition method and system based on image segmentation
CN110796143A (en) Scene text recognition method based on man-machine cooperation
CN113378815B (en) Scene text positioning and identifying system and training and identifying method thereof
CN115424282A (en) Unstructured text table identification method and system
CN112037239B (en) Text guidance image segmentation method based on multi-level explicit relation selection
CN113408584A (en) RGB-D multi-modal feature fusion 3D target detection method
CN113269089A (en) Real-time gesture recognition method and system based on deep learning
CN111027456A (en) Mechanical water meter reading identification method based on image identification
CN117370498B (en) Unified modeling method for 3D open vocabulary detection and closed caption generation
CN110096987B (en) Dual-path 3DCNN model-based mute action recognition method
CN115188378A (en) Target recognition visual ranging method and system based on voice interaction
CN113743389A (en) Facial expression recognition method and device and electronic equipment
CN117274388A (en) Unsupervised three-dimensional visual positioning method and system based on visual text relation alignment
CN113223037A (en) Unsupervised semantic segmentation method and unsupervised semantic segmentation system for large-scale data
CN117953224B (en) Open vocabulary 3D panorama segmentation method and system
CN117953224A (en) Open vocabulary 3D panorama segmentation method and system
CN115471901A (en) Multi-pose face frontization method and system based on generation of confrontation network
CN112215285B (en) Cross-media-characteristic-based automatic fundus image labeling method
CN109657691B (en) Image semantic annotation method based on energy model
Das et al. Object Detection on Scene Images: A Novel Approach
Guan et al. Synthetic region screening and adaptive feature fusion for constructing a flexible object detection database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant