CN117830601A - Three-dimensional visual positioning method, device, equipment and medium based on weak supervision - Google Patents

Three-dimensional visual positioning method, device, equipment and medium based on weak supervision Download PDF

Info

Publication number
CN117830601A
CN117830601A CN202410239096.0A CN202410239096A CN117830601A CN 117830601 A CN117830601 A CN 117830601A CN 202410239096 A CN202410239096 A CN 202410239096A CN 117830601 A CN117830601 A CN 117830601A
Authority
CN
China
Prior art keywords
text
features
point cloud
picture
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410239096.0A
Other languages
Chinese (zh)
Other versions
CN117830601B (en
Inventor
王旭
许晓旭
张秋丹
刘学讯
江健民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202410239096.0A priority Critical patent/CN117830601B/en
Publication of CN117830601A publication Critical patent/CN117830601A/en
Application granted granted Critical
Publication of CN117830601B publication Critical patent/CN117830601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a three-dimensional visual positioning method, device, equipment and medium based on weak supervision, which are used for carrying out 3D proposal frame inquiry on an input inquiry text according to a 3D target detector of a pre-trained 3D classification model to generate 3D proposal frame characteristics and corresponding three-dimensional residual characteristics thereof; acquiring query characteristics and category residual characteristics of an input query text according to a text classifier of the 3D classification model; performing matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics; calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames; and taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets. The method and the device can reduce the labeling work of the three-dimensional boundary box, improve the precision of three-dimensional visual positioning and promote the application of three-dimensional visual positioning.

Description

Three-dimensional visual positioning method, device, equipment and medium based on weak supervision
Technical Field
The invention relates to the technical field of image processing, in particular to a three-dimensional visual positioning method, device, equipment and medium based on weak supervision.
Background
Three-dimensional visual localization through a given natural language query, accurately locates target objects in a 3D scene. Three-dimensional visual localization has gained considerable attention and has evolved over the last few years. However, the existing three-dimensional visual positioning mainly explores a full-supervision solution of the three-dimensional visual positioning, and the full-supervision method provides a three-dimensional bounding box of text query in the training process and helps the model establish explicit alignment between two modes. However, marking dense three-dimensional bounding boxes in a point cloud is a time consuming, labor intensive and expensive task that would seriously hamper three-dimensional visual localization applications and affect the accuracy of three-dimensional visual localization.
Disclosure of Invention
In order to solve the problems, the invention provides a three-dimensional visual positioning method, device, equipment and medium based on weak supervision, which can reduce the labeling work of a three-dimensional bounding box, improve the precision of three-dimensional visual positioning and promote the application of three-dimensional visual positioning.
The embodiment of the invention provides a three-dimensional visual positioning method based on weak supervision, which comprises the following steps:
3D proposal frame inquiry is carried out on the input inquiry text according to a 3D object detector of a pre-trained 3D classification model, and 3D proposal frame characteristics and corresponding three-dimensional residual characteristics are generated;
acquiring query characteristics and category residual characteristics of an input query text according to a text classifier of the 3D classification model;
performing matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics;
calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames;
and taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets.
Preferably, the 3D classification model training process specifically includes:
acquiring a training data set containing object point clouds, object pictures and descriptive texts;
extracting three-dimensional candidate target frames from the object point cloud by adopting a pre-trained 3D target detector;
inputting the description text and the object point cloud in the target frame into a preset network;
projecting object point clouds in the target frame to a 2D plane according to the camera inner and outer parameters, and determining a picture frame containing the object point clouds;
respectively extracting characteristics of the description text, the picture frame and object point clouds in the target frame by using a text encoder, an image encoder and a 3D encoder in a preset CLIP encoder to respectively obtain text characteristics, picture characteristics and point cloud characteristics;
respectively carrying out similarity calculation on the text features, the picture features and the point cloud features, and determining class probability between the picture and the point cloud;
respectively adding an adapter into the text encoder, the image encoder and the 3D encoder, aligning the three-dimensional characteristics with the text characteristics by taking the picture characteristics as a bridge, and establishing a semantic relation between the text characteristics and the point cloud characteristics by using residual connection;
training and optimizing through a preset classification loss function, completing 3D classification model training, and obtaining a 3D target detector and a text classifier.
Further, after obtaining the text feature, the picture feature and the point cloud feature, the method further includes:
and comparing and learning the point cloud features and the picture features, and establishing a matching relationship between the picture features and the point cloud features.
As an improvement of the above solution, the comparing and learning the point cloud feature and the picture feature, and establishing a matching relationship between the picture feature and the point cloud feature specifically includes:
according to a semantic alignment mode of point cloud features and picture features in the 3D encoder, implicitly aligning the point cloud features with the picture features;
and optimally learning a loss function of the 3D encoder, and aligning the learned 3D encoder generation point cloud characteristics with the picture characteristics.
Preferably, the loss function after the point cloud feature and the text feature are aligned is:
wherein,for the ith picture semantic feature, +.>For the ith point cloud semantic feature, +.>For the j-th picture semantic feature, +.>For the j-th point cloud semantic feature, M is the total number of features, +.>Loss value for alignment of the text feature with the point cloud feature, < >>Is a temperature super parameter.
As a preferable mode, theThe classification loss function is
Wherein,for classifying loss values, ++>Loss value for alignment of the text feature with the point cloud feature, < >>For the loss value between the point cloud feature and the text feature +.>、/>And->Classification losses of query text, picture features and point cloud features, respectively, < >>Is a preset loss ratio.
Preferably, the calculation formula of the residual connection is:
wherein,is the proportion of the residual connection,/>Is corresponding to the feature obtained by text, two-dimensional and three-dimensional encoder feature encoding, < >>Is->And mapping the obtained characteristics through two full-connection layers.
The embodiment of the invention also provides a three-dimensional visual positioning device based on weak supervision, which comprises:
the 3D query module is used for carrying out 3D proposal frame query on the input query text according to a 3D target detector of the pre-trained 3D classification model to generate 3D proposal frame features and corresponding three-dimensional residual features thereof;
the text query module is used for acquiring query characteristics and category residual characteristics of the input query text according to the text classifier of the 3D classification model;
the multiplication module is used for carrying out matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics;
the cosine module is used for calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames;
and the positioning module is used for taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets.
Preferably, the apparatus further comprises a training module for:
acquiring a training data set containing object point clouds, object pictures and descriptive texts;
extracting three-dimensional candidate target frames from the object point cloud by adopting a pre-trained 3D target detector;
inputting the description text and the object point cloud in the target frame into a preset network;
projecting object point clouds in the target frame to a 2D plane according to the camera inner and outer parameters, and determining a picture frame containing the object point clouds;
respectively extracting characteristics of the description text, the picture frame and object point clouds in the target frame by using a text encoder, an image encoder and a 3D encoder in a preset CLIP encoder to respectively obtain text characteristics, picture characteristics and point cloud characteristics;
respectively carrying out similarity calculation on the text features, the picture features and the point cloud features, and determining class probability between the picture and the point cloud;
respectively adding an adapter into the text encoder, the image encoder and the 3D encoder, aligning the three-dimensional characteristics with the text characteristics by taking the picture characteristics as a bridge, and establishing a semantic relation between the text characteristics and the point cloud characteristics by using residual connection;
training and optimizing through a preset classification loss function, completing 3D classification model training, and obtaining a 3D target detector and a text classifier.
Further, the training module is specifically configured to:
and comparing and learning the point cloud features and the picture features, and establishing a matching relationship between the picture features and the point cloud features.
Further, the training module is specifically configured to:
according to a semantic alignment mode of point cloud features and picture features in the 3D encoder, implicitly aligning the point cloud features with the picture features;
and optimally learning a loss function of the 3D encoder, and aligning the learned 3D encoder generation point cloud characteristics with the picture characteristics.
Preferably, the loss function after the point cloud feature and the text feature are aligned is:
wherein,for the ith picture semantic feature, +.>For the ith point cloud semantic feature, +.>For the j-th picture semantic feature, +.>Is the j-th point cloud semantic feature, M is specialTotal number of symptoms, ->Loss value for alignment of the text feature with the point cloud feature, < >>Is a temperature super parameter.
Preferably, the classification loss function is
Wherein,for classifying loss values, ++>Loss value for alignment of the text feature with the point cloud feature, < >>For the loss value between the point cloud feature and the text feature +.>、/>And->Classification losses of query text, picture features and point cloud features, respectively, < >>Is a preset loss ratio.
Preferably, the calculation formula of the residual connection is:
wherein,is the proportion of the residual connection,/>Is corresponding to the feature obtained by text, two-dimensional and three-dimensional encoder feature encoding, < >>Is->And mapping the obtained characteristics through two full-connection layers.
The embodiment of the invention also provides a terminal device, which comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the three-dimensional visual positioning method based on weak supervision is realized when the processor executes the computer program.
The embodiment of the invention also provides a computer readable storage medium, which comprises a stored computer program, wherein the computer program controls equipment where the computer readable storage medium is located to execute the three-dimensional visual positioning method based on weak supervision according to any one of the embodiments.
The invention provides a three-dimensional visual positioning method, device, equipment and medium based on weak supervision, which are used for carrying out 3D proposal frame inquiry on an input inquiry text according to a 3D target detector of a pre-trained 3D classification model to generate 3D proposal frame characteristics and corresponding three-dimensional residual characteristics thereof; acquiring query characteristics and category residual characteristics of an input query text according to a text classifier of the 3D classification model; performing matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics; calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames; and taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets. The method and the device can reduce the labeling work of the three-dimensional boundary box, improve the precision of three-dimensional visual positioning and promote the application of three-dimensional visual positioning.
Drawings
Fig. 1 is a schematic flow chart of a three-dimensional visual positioning method based on weak supervision according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a 3D classification model training process according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a three-dimensional visual positioning device based on weak supervision according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a three-dimensional visual positioning method based on weak supervision, and referring to fig. 1, the embodiment of the invention provides a flow diagram of the three-dimensional visual positioning method based on weak supervision. The method comprises the following steps:
step S1, carrying out 3D proposal frame inquiry on an input inquiry text according to a 3D target detector of a pre-trained 3D classification model, and generating 3D proposal frame characteristics and corresponding three-dimensional residual characteristics thereof;
s2, acquiring query characteristics and category residual characteristics of an input query text according to a text classifier of the 3D classification model;
step S3, performing matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics;
step S4, calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames;
and S5, taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets.
In specific implementation, three-dimensional vision definitionThe input of bits includes a three-dimensional point cloud scene and a text query Q. Point cloud sceneIndicating that there are N points in the scene, each point being represented by six dimensions rgxyz. The 3D object suggestion boxes for this scene are readily available and 3D suggestion box generation can be done with pre-trained 3D object detectors, which will provide initial candidate suggestion boxes for later three-dimensional visual localization. Each dataset provides class labels for 3D objects, and all of these class labels will also be encoded to obtain their features, supporting coarse-grained classification tasks to aid model learning.
Acquiring three-dimensional proposal frame characteristics from 3D proposal frame inquiry by a 3D object detector according to input inquiry text, generating 3D proposal frame characteristics and corresponding three-dimensional residual characteristicsAnd three-dimensional residual characteristics->Obtaining text query feature from text module>And category residual feature->And +.>And obtaining a query classification result of the 3D-VG class.
By proposing frame featuresAnd category residual feature->Matrix multiplication is performed to obtain a category prediction for each three-dimensional proposal frame. In order to make the category to which the target proposal corresponds more consistent with the category to which the query corresponds.
Calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames;
according to its three-dimensional characteristicsAnd query feature->The cosine similarity between them is ranked, and the proposal frame and the category feature with the highest similarity score are selected as the query targets.
According to the method and the device, the labeling work of the three-dimensional boundary box can be reduced, the precision of three-dimensional visual positioning is improved, and the application of three-dimensional visual positioning is promoted.
In yet another embodiment of the present invention, the 3D classification model training process specifically includes:
acquiring a training data set containing object point clouds, object pictures and descriptive texts;
extracting three-dimensional candidate target frames from the object point cloud by adopting a pre-trained 3D target detector;
inputting the description text and the object point cloud in the target frame into a preset network;
projecting object point clouds in the target frame to a 2D plane according to the camera inner and outer parameters, and determining a picture frame containing the object point clouds;
respectively extracting characteristics of the description text, the picture frame and object point clouds in the target frame by using a text encoder, an image encoder and a 3D encoder in a preset CLIP encoder to respectively obtain text characteristics, picture characteristics and point cloud characteristics;
respectively carrying out similarity calculation on the text features, the picture features and the point cloud features, and determining class probability between the picture and the point cloud;
respectively adding an adapter in the text encoder, the image encoder and the 3D encoder, implicitly aligning the three-dimensional characteristics with the text characteristics by taking the picture characteristics as a bridge, and establishing a semantic relation between the text characteristics and the point cloud characteristics by using residual connection;
training and optimizing through a preset classification loss function, completing 3D classification model training, and obtaining a 3D target detector and a text classifier.
In the implementation of the present embodiment, referring to fig. 2, a schematic flow chart of a 3D classification model training process according to an embodiment of the present invention is shown.
In the training phase, a training data set is acquired. Extracting a suggestion box containing object point clouds from the point cloud scene, and inputting descriptive text in the scene and the object point clouds in the suggestion box into the network.
And projecting the object point cloud to a 2D plane through the internal and external parameters of the camera, and selecting a picture frame containing the object point cloud.
The text encoder, the image encoder and the 3D encoder of CLIP are used for extracting features of descriptive text, object pictures and object point clouds respectively.
And carrying out similarity calculation on the obtained text features, the picture features and the point cloud features to obtain the category probabilities of the picture and the point cloud, thereby classifying.
One adaptation is added to each of the text encoder, the image encoder and the 3D encoder, all of which have the same structure (two identical full-join layers with ReLU activation functions) and use residual join to keep the source semantics consistent with the post-adaptation semantics.
Implicitly aligning the three-dimensional features and the text features by taking the picture features as bridges, and establishing semantic relations between the text features and the point cloud features by using residual connection;
training and optimizing through a preset classification loss function, completing 3D classification model training, and obtaining a 3D target detector and a text classifier.
The 3D classification model consists of a text module, a 2D module and a 3D module. Firstly, three-dimensional candidate target frames are extracted from a point cloud scene, the candidate frames are projected onto a 2D image area through the conversion relation of internal parameters and external parameters of a camera, and then a text encoder and an image encoder with frozen parameters in a CLIP model are utilized to respectively obtain characteristics of a text query and the 2D image area. Thus, the correspondence between text queries and 2D image regions can be measured by their CLIP characteristics. The present application utilizes contrast learning to optimize the 3D encoder in the 3D module such that the learned 3D features are similar to the text and 2D CLIP features.
For a 3D candidate proposal frame, 1024 points are sampled in the frame, then PointNet++ is used for initial feature coding, and then a standard transducer module is used for extracting advanced 3D semantic featuresWherein->Is the total number of 3D candidate frames, the above steps are 3D encoder +.>Realizing the method.
Text encoder the text encoder in the large-scale visual language model CLIP is treated as a text encoder(other large models of text encoders could be used, except that the application uses the CLIP text encoder) and uses it to accurately query +.>Features of->. At the same time, each category label in the complete category list of the dataset is composed of +.>Coding and representing as category characteristics +.>Wherein->Representing the number of categories. During training, freeze->And directly loads the CLIP pre-training model.
The 2D encoder projects, for each 3D candidate proposal frame, the point cloud in the proposal frame into the original video through the inter-and-inter-parametric relationship of the cameraAnd obtaining a corresponding two-dimensional image area on each sampling frame. In practice each 3D proposal box may have multiple correspondence in different frames of the video and thus point to multiple 2D image areas. Only a two-dimensional image area containing the most three-dimensional projection points projected from the point cloud is selected to be paired with the three-dimensional candidate frame. Image encoder with CLIP->To extract semantic features of these 2D image areas, denoted +.>. Similarly, we also freezeAnd directly loads the pretrained model parameters of the CLIP.
According to the embodiment, the two-dimensional image is used as a bridge, the conversion relation of the internal parameters and the external parameters of the camera is utilized, the three-dimension corresponds to the two-dimension, the two-dimension of the large-scale visual language model corresponds to the text, and the semantic relation between the text and the three-dimensional point cloud is implicitly established.
In yet another embodiment provided by the present invention, after obtaining the text feature, the picture feature, and the point cloud feature separately, the method further includes:
and comparing and learning the point cloud features and the picture features, and establishing a matching relationship between the picture features and the point cloud features.
When the embodiment is implemented, feature extraction is performed on the descriptive text, the object picture and the object point cloud respectively, after text features, picture features and point cloud features are obtained respectively, the point cloud features and the picture features are subjected to contrast learning, so that the mode variability is reduced, and the one-to-one matching relationship between the picture features and the point cloud features is realized.
In still another embodiment of the present invention, the comparing the point cloud feature with the image feature to establish a matching relationship between the image feature and the point cloud feature specifically includes:
according to a semantic alignment mode of point cloud features and picture features in the 3D encoder, implicitly aligning the point cloud features with the picture features;
and optimally learning a loss function of the 3D encoder, and aligning the learned 3D encoder generation point cloud characteristics with the picture characteristics.
When the embodiment is implemented, because a high-level semantic alignment mode is established between the 2D image features and the text features by a large-scale visual language model such as CLIP, the 2D corresponding relation of each 3D point cloud proposal frame can be conveniently obtained, two-dimensional features can be naturally used as bridges, and the three-dimensional features and the text features are implicitly aligned by contrast learning.
Specifically, following classical contrast loss, the relationship of paired 3D proposed box features and 2D image region features is pulled closer, and unpaired pushout is performed.
In yet another embodiment of the present invention, the aligned point cloud feature and the text feature have a loss function of:
wherein,for the ith picture semantic feature, +.>For the ith point cloud semantic feature, +.>For the j-th picture semantic feature, +.>For the j-th point cloud semantic feature, M is the total number of features, +.>Loss value for alignment of the text feature with the point cloud feature, < >>Is a temperature super parameter.
In the implementation of this embodiment, the relationship of paired 3D proposed box features and 2D image region features is pulled up and unpaired pushed away following classical contrast loss. Specifically defining a loss function after the point cloud features and the text features are aligned as follows:
wherein,for the ith picture semantic feature, +.>For the ith point cloud semantic feature, +.>For the j-th picture semantic feature, +.>For the j-th point cloud semantic feature, M is the total number of features, +.>Loss value for alignment of the text feature with the point cloud feature, < >>Is a temperature super parameter.
By optimizing the above-described penalty function, the learned 3D proposal frame features generated by the 3D encoder can be aligned with their 2D image features so that they can be compared to the text features of the query.
In a further embodiment provided by the present invention, the classification loss function is
Wherein,for classifying loss values, ++>Loss value for alignment of the text feature with the point cloud feature, < >>For the loss value between the point cloud feature and the text feature +.>、/>And->Classification losses of query text, picture features and point cloud features, respectively, < >>Is a preset loss ratio.
In the embodiment, forAnd two-dimensional residual feature->Matrix multiplication is performed to obtain a two-dimensional classification probability +.>. At->The Softmax layer was applied on top of it,introducing two-dimensional classification cross entropy loss->To supervise the classification process of the two-dimensional images. Three-dimensional classification loss +.>
The overall model is optimized through a classification loss function, wherein the classification loss function is as follows:
wherein,for classifying loss values, ++>Loss value for alignment of the text feature with the point cloud feature, < >>For the loss value between the point cloud feature and the text feature +.>、/>And->Classification losses of query text, picture features and point cloud features, respectively, < >>Is a preset loss ratio.
By introducing the coarse-granularity classification signals without the need of fine-granularity frame annotation, the learned and adapted features can have better semantic perceptibility on the point cloud of the indoor scene, thereby assisting the 3D-VG process.
In yet another embodiment of the present invention, the calculation formula of the residual connection is:
wherein,is the proportion of the residual connection,/>Is corresponding to the feature obtained by text, two-dimensional and three-dimensional encoder feature encoding, < >>Is->And mapping the obtained characteristics through two full-connection layers.
In the implementation of the embodiment, in order to introduce 3D-VG task perception semantic knowledge into the whole model, three classification tasks based on residual features are introducedAnd->. First query the feature ∈>Adding a text classifier to predict the distribution on class labels of 3D-VG datasets, loss +.>Supervised we represent this as query classification loss. The residual connection is used for keeping the source semantics consistent with the semantics after adaptation, and the calculation formula of the residual connection is as follows: />
Wherein,is the proportion of the residual connection,/>Is corresponding to the feature obtained by text, two-dimensional and three-dimensional encoder feature encoding, < >>Is->And mapping the obtained characteristics through two full-connection layers.
The method provided by the application utilizes contrast learning to obtain 3D proposal frame characteristics, can be basically aligned with 2D and text characteristics of a large-scale visual language model, and also guides the learned characteristics to better support 3D visual basis through multi-mode self-adaption introduced by task perception classification.
The application has been tested on two published and widely used data sets, respectively: scanRefer and refer 3D.
The ScanRefer dataset originates from the indoor scene dataset ScanNet, which is divided into two distinct parts: "Unique" and "Multiple" indicate whether a scene contains more than two interferents.
The refer 3D dataset was also proposed based on the ScanNet dataset. It consists of two subsets: sr3D and Nr3D, sr3D and Nr3D employ two different data splitting approaches. The partitioning of Easy and Hard is based on the number of interferents in the scene. The "View-dep" and "View-index" are divided according to whether the language expression depends on the speaker's viewing angle.
For the RefebIt3D dataset, it provides a three-dimensional proposal box and category labels in the indoor point cloud scene. Thus, the proposed box provided can be used directly as a three-dimensional candidate proposed box and utilized to provide coarse-grained supervisory signals for the model. However, for the ScanRefer dataset, it does not provide the two information described above. Thus, using a pre-trained PointGroup as a detector, the proposal box and its class labels are pre-extracted, and then the pre-extracted information is utilized to aid model training.
The performance result of the 3D classification model of the three-dimensional visual positioning method provided by the scheme on the Refereit 3D data set is compared with the full supervision model. While the 3D classification model is not completely superior to the supervised model over all subsets, its performance is still comparable. Such results demonstrate the effectiveness and potential of the weakly supervised training approach provided by the present application, which does not utilize any 3D box annotation or explicit 3D-text correspondence supervision. The impact of the collection and development of large-scale datasets on the ability of the three-dimensional visual localization model is avoided.
Referring to fig. 3, a schematic structural diagram of a three-dimensional visual positioning device based on weak supervision according to an embodiment of the present invention is provided, where the device includes:
the 3D query module is used for carrying out 3D proposal frame query on the input query text according to a 3D target detector of the pre-trained 3D classification model to generate 3D proposal frame features and corresponding three-dimensional residual features thereof;
the text query module is used for acquiring query characteristics and category residual characteristics of the input query text according to the text classifier of the 3D classification model;
the multiplication module is used for carrying out matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics;
the cosine module is used for calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames;
and the positioning module is used for taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets.
The three-dimensional visual positioning device based on weak supervision provided in this embodiment can perform all the steps and functions of the three-dimensional visual positioning method based on weak supervision provided in any one of the above embodiments, and specific functions of the device are not described herein.
Referring to fig. 4, a schematic structural diagram of a terminal device according to an embodiment of the present invention is provided. The terminal device includes: a processor, a memory and a computer program stored in the memory and executable on the processor, such as a weakly supervised based three dimensional visual localization program. The steps in the embodiment of the three-dimensional visual positioning method based on weak supervision, such as steps S1 to S5 shown in fig. 1, are implemented when the processor executes the computer program. Alternatively, the processor may implement the functions of the modules in the above-described device embodiments when executing the computer program.
The computer program may be divided into one or more modules, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program in the weakly supervised based three dimensional visual localization apparatus. For example, the computer program may be divided into various modules, and specific functions of each module are described in detail in the three-dimensional visual positioning method based on weak supervision provided in any one of the foregoing embodiments, and specific functions of the device are not described herein.
The three-dimensional visual positioning device based on weak supervision can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The three-dimensional visual positioning device based on weak supervision can comprise, but is not limited to, a processor and a memory. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a weakly-supervision based three-dimensional visual positioning apparatus, and is not meant to be limiting of a weakly-supervision based three-dimensional visual positioning apparatus, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the weakly-supervision based three-dimensional visual positioning apparatus may further include input and output devices, network access devices, buses, etc.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the three-dimensional visual positioning device based on weak supervision, and various interfaces and lines are used to connect various parts of the whole three-dimensional visual positioning device based on weak supervision.
The memory may be used to store the computer program and/or the module, and the processor may implement the various functions of the three-dimensional visual positioning device based on weak supervision by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the module integrated on the basis of the weakly-supervised three-dimensional visual positioning device can be stored in a computer readable storage medium if the module is realized in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (10)

1. A weakly supervised three dimensional visual localization method, the method comprising:
3D proposal frame inquiry is carried out on the input inquiry text according to a 3D object detector of a pre-trained 3D classification model, and 3D proposal frame characteristics and corresponding three-dimensional residual characteristics are generated;
acquiring query characteristics and category residual characteristics of an input query text according to a text classifier of the 3D classification model;
performing matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics;
calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames;
and taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets.
2. The weakly supervised based three dimensional visual localization method of claim 1, wherein the 3D classification model training process specifically comprises:
acquiring a training data set containing object point clouds, object pictures and descriptive texts;
extracting three-dimensional candidate target frames from the object point cloud by adopting a pre-trained 3D target detector;
inputting the description text and the object point cloud in the target frame into a preset network;
projecting object point clouds in the target frame to a 2D plane according to the camera inner and outer parameters, and determining a picture frame containing the object point clouds;
respectively extracting characteristics of the description text, the picture frame and object point clouds in the target frame by using a text encoder, an image encoder and a 3D encoder in a preset CLIP encoder to respectively obtain text characteristics, picture characteristics and point cloud characteristics;
respectively carrying out similarity calculation on the text features, the picture features and the point cloud features, and determining class probability between the picture and the point cloud;
respectively adding an adapter into the text encoder, the image encoder and the 3D encoder, aligning the three-dimensional characteristics with the text characteristics by taking the picture characteristics as a bridge, and establishing a semantic relation between the text characteristics and the point cloud characteristics by using residual connection;
training and optimizing through a preset classification loss function, completing 3D classification model training, and obtaining a 3D target detector and a text classifier.
3. The weakly supervised based three dimensional visual localization method of claim 2, further comprising, after deriving text features, picture features, and point cloud features, respectively:
and comparing and learning the point cloud features and the picture features, and establishing a matching relationship between the picture features and the point cloud features.
4. The weakly supervised three dimensional visual localization method of claim 3, wherein the performing contrast learning on the point cloud features and the picture features to establish a matching relationship between the picture features and the point cloud features specifically comprises:
according to a semantic alignment mode of point cloud features and picture features in the 3D encoder, implicitly aligning the point cloud features with the picture features;
and optimally learning a loss function of the 3D encoder, and aligning the learned 3D encoder generation point cloud characteristics with the picture characteristics.
5. The weakly supervised based three dimensional visual localization method of claim 4, wherein the aligned point cloud features and text features have a loss function of:
wherein,for the ith picture semantic feature, +.>For the ith point cloud semantic feature, +.>For the j-th picture semantic feature, +.>For the j-th point cloud semantic feature, M is the total number of features, +.>Loss value for alignment of the text feature with the point cloud feature, < >>Is a temperature super parameter.
6. The weakly supervised based three dimensional visual localization method of claim 2, wherein the classification loss function is
Wherein,for classifying loss values, ++>Loss value for alignment of the text feature with the point cloud feature, < >>For the loss value between the point cloud feature and the text feature +.>、/>And->Classification losses of query text, picture features and point cloud features, respectively, < >>Is a preset loss ratio.
7. The weakly supervised based three dimensional visual localization method of claim 2, wherein the residual connectivity is calculated as:
wherein,is the proportion of the residual connection,/>Is corresponding toTextually, two-and three-dimensionally encoder feature-encoded features, < >>Is->And mapping the obtained characteristics through two full-connection layers.
8. A weakly supervised three dimensional visual localization apparatus, the apparatus comprising:
the 3D query module is used for carrying out 3D proposal frame query on the input query text according to a 3D target detector of the pre-trained 3D classification model to generate 3D proposal frame features and corresponding three-dimensional residual features thereof;
the text query module is used for acquiring query characteristics and category residual characteristics of the input query text according to the text classifier of the 3D classification model;
the multiplication module is used for carrying out matrix multiplication on the proposal frame characteristics and the category residual characteristics of each proposal frame to obtain category characteristics;
the cosine module is used for calculating cosine similarity between three-dimensional residual characteristics and query characteristics of different proposal frames;
and the positioning module is used for taking the proposal frame with the highest cosine similarity score and the category characteristics as query targets.
9. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the weakly supervised based three dimensional visual localization method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when run, controls a device in which the computer readable storage medium is located to perform the weakly supervised three dimensional visual localization method as set forth in any one of claims 1 to 7.
CN202410239096.0A 2024-03-04 2024-03-04 Three-dimensional visual positioning method, device, equipment and medium based on weak supervision Active CN117830601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410239096.0A CN117830601B (en) 2024-03-04 2024-03-04 Three-dimensional visual positioning method, device, equipment and medium based on weak supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410239096.0A CN117830601B (en) 2024-03-04 2024-03-04 Three-dimensional visual positioning method, device, equipment and medium based on weak supervision

Publications (2)

Publication Number Publication Date
CN117830601A true CN117830601A (en) 2024-04-05
CN117830601B CN117830601B (en) 2024-05-24

Family

ID=90513743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410239096.0A Active CN117830601B (en) 2024-03-04 2024-03-04 Three-dimensional visual positioning method, device, equipment and medium based on weak supervision

Country Status (1)

Country Link
CN (1) CN117830601B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116363212A (en) * 2023-02-28 2023-06-30 浙江大学 3D visual positioning method and system based on semantic matching knowledge distillation
CN117274388A (en) * 2023-10-17 2023-12-22 四川大学 Unsupervised three-dimensional visual positioning method and system based on visual text relation alignment
CN117292237A (en) * 2023-09-25 2023-12-26 深圳大学 Joint reconstruction method, device and medium based on local transducer network
CN117496130A (en) * 2023-11-22 2024-02-02 中国科学院空天信息创新研究院 Basic model weak supervision target detection method based on context awareness self-training

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116363212A (en) * 2023-02-28 2023-06-30 浙江大学 3D visual positioning method and system based on semantic matching knowledge distillation
CN117292237A (en) * 2023-09-25 2023-12-26 深圳大学 Joint reconstruction method, device and medium based on local transducer network
CN117274388A (en) * 2023-10-17 2023-12-22 四川大学 Unsupervised three-dimensional visual positioning method and system based on visual text relation alignment
CN117496130A (en) * 2023-11-22 2024-02-02 中国科学院空天信息创新研究院 Basic model weak supervision target detection method based on context awareness self-training

Also Published As

Publication number Publication date
CN117830601B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
US11610384B2 (en) Zero-shot object detection
CN111898696B (en) Pseudo tag and tag prediction model generation method, device, medium and equipment
Rahman et al. Notice of violation of IEEE publication principles: Recent advances in 3D object detection in the era of deep neural networks: A survey
WO2019169872A1 (en) Method and device for searching for content resource, and server
CN107766555B (en) Image retrieval method based on soft-constraint unsupervised cross-modal hashing
CN111680678B (en) Target area identification method, device, equipment and readable storage medium
CN112015923A (en) Multi-mode data retrieval method, system, terminal and storage medium
CN104915673A (en) Object classification method and system based on bag of visual word model
CN116431847B (en) Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
US11645478B2 (en) Multi-lingual tagging for digital images
Tan et al. Distinctive accuracy measurement of binary descriptors in mobile augmented reality
US20150154317A1 (en) Memory provided with set operation function, and method for processing set operation processing using same
CN116578738B (en) Graph-text retrieval method and device based on graph attention and generating countermeasure network
CN113468371A (en) Method, system, device, processor and computer readable storage medium for realizing natural sentence image retrieval
CN117830601B (en) Three-dimensional visual positioning method, device, equipment and medium based on weak supervision
Cao et al. Stable image matching for 3D reconstruction in outdoor
CN116958724A (en) Training method and related device for product classification model
CN113420545B (en) Abstract generation method, device, equipment and storage medium
CN113627186B (en) Entity relation detection method based on artificial intelligence and related equipment
CN115346095A (en) Visual question answering method, device, equipment and storage medium
CN114970467A (en) Composition initial draft generation method, device, equipment and medium based on artificial intelligence
Cao et al. Evaluation of local features for structure from motion
CN114139530A (en) Synonym extraction method and device, electronic equipment and storage medium
Jaiswal et al. US Traffic Sign Recognition by Using Partial OCR and Inbuilt Dictionary
CN111680722B (en) Content identification method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant