CN115169448A - Three-dimensional description generation and visual positioning unified method based on deep learning - Google Patents

Three-dimensional description generation and visual positioning unified method based on deep learning Download PDF

Info

Publication number
CN115169448A
CN115169448A CN202210739467.2A CN202210739467A CN115169448A CN 115169448 A CN115169448 A CN 115169448A CN 202210739467 A CN202210739467 A CN 202210739467A CN 115169448 A CN115169448 A CN 115169448A
Authority
CN
China
Prior art keywords
module
description generation
visual positioning
target
objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210739467.2A
Other languages
Chinese (zh)
Inventor
盛律
徐东
赵立晨
蔡代刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202210739467.2A priority Critical patent/CN115169448A/en
Publication of CN115169448A publication Critical patent/CN115169448A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a three-dimensional description generation and visual positioning unified method based on deep learning, which uses a united frame to realize the three-dimensional description generation and the united information output of point cloud visual positioning in a complex scene; the combi-frame comprises: the system comprises a target detection module, a feature perception enhancement module, a description generation module and a visual positioning module. According to the method, through the joint training of the three-dimensional point cloud visual positioning and the description generation task, the model can fully learn the relation characteristics among objects from the visual positioning task and can also learn the fine-grained characteristics of the objects from the description generation task; the method comprises the following steps of enabling a scene needing three-dimensional point cloud visual positioning and description generation tasks to adopt a trained combined frame to realize description generation and visual positioning of objects in the scene; the method can be beneficial to the development of AR/VR industry, and the combined application of the AR/VR industry and the model can adapt to more practical application scenes, so that the life of people is facilitated.

Description

Three-dimensional description generation and visual positioning unified method based on deep learning
Technical Field
The invention relates to the field of point cloud description generation, visual positioning and deep learning, in particular to a three-dimensional description generation and visual positioning unified method based on deep learning.
Background
For the indoor robot navigation technology, sensing the indoor complex environment is an indispensable step. The robot needs to have the ability to understand the scene, find the object we need or the location to navigate to from the complex environment. In AR/VR, generating a corresponding description for each object in a three-dimensional model space is a very new technology, and has applications in the VR game field, the AR furniture selection, the virtual navigation and other scenes. To further enhance the interaction capability of 3D scenes and natural language, researchers have proposed visual localization and description generation tasks.
The visual positioning task is to input a text description and a scene, so that the model can position the position of an object described by the text in the scene, and the description generation task is to input a scene and generate a corresponding text description for any object in the scene. In the field of three-dimensional point cloud visual positioning and description generation, the conventional method respectively realizes the two tasks in different networks.
The current three-dimensional point cloud visual positioning and description generation method mostly consists of two stages. In a first stage, generating object suggestions and candidate boxes from an input scene using a three-dimensional object detector or a panorama segmentation model; in the second stage, the visual positioning model uses the inputted text description to position the object suggestion that matches the input text description, and the description generation model generates a corresponding description sentence for each object suggestion. However, the existing three-dimensional visual positioning technology does not well consider the relationship between different objects, and the three-dimensional description generation technology ignores appearance representation information of the object itself.
Therefore, the existing three-dimensional point cloud visual positioning and description generation method has no combined application and cannot adapt to more AR/VR practical application scenes; in addition, the previous methods all use different networks, only one task can be realized each time, and the performance of the previous methods is also defective. For example, patent document publication nos.: the technology of CN113657478A enhances the relationship between different objects through relational modeling, but the method of this document strongly depends on the visual positioning problem itself, and cannot be extended to describe the generation task.
Disclosure of Invention
The invention aims to provide a three-dimensional description generation and visual positioning unified method based on deep learning, which can solve the defects, adopts a combined frame to realize the description generation and the visual positioning of objects in a scene, and achieves the effect of mutual assistance of the two tasks while realizing the two tasks.
In order to achieve the purpose, the invention adopts the technical scheme that:
the invention provides a three-dimensional description generation and visual positioning unified method based on deep learning, which uses a combined frame to realize the combined information output of three-dimensional description generation and point cloud visual positioning in a complex scene; the combi-frame comprises: the system comprises a target detection module, a feature perception enhancement module, a description generation module and a visual positioning module; the method comprises the following steps:
acquiring related point cloud data and corresponding text data of a preset scene, and dividing the point cloud data and the corresponding text data into a training set, a cross validation set and a test set according to a preset proportion;
inputting the related point cloud data in the training set into the target detection module, positioning objects in a scene and generating an initial object suggestion;
inputting the initial object suggestion into the feature perception enhancement module to generate a corresponding enhancement target suggestion;
inputting the enhancement target suggestions into a description generation module and a visual positioning module respectively; the description generation module converts the enhanced target suggestions into text features and generates description sentences corresponding to the object suggestions; the visual positioning module fuses the corresponding text data and the enhancement target suggestion to generate the position of the described object;
performing iterative training on the combined framework, and performing verification and test by adopting the cross verification set and the test set;
and (3) adopting a trained combined frame to realize the description generation and the visual positioning of objects in the scene for the scene needing the three-dimensional point cloud visual positioning and description generation task.
Further, the target detection module adopts a VoteNet object detection module, predicts the distance from the central point to each measurement, codes the point cloud, positions the object in the scene and generates an initial object suggestion.
Further, the feature perception enhancement module consists of two stacked multi-head self-attention layers, wherein an additional attribute coding module and a relationship coding module are included;
the attribute coding module and the relation coding module are both composed of a plurality of full connection layers; the attribute coding module is used for coding the characteristics of the object, wherein the characteristics comprise: color, size and shape information; the relation coding module is used for coding the distance information between every two objects; the attribute codes and the relationship codes constitute enhanced target suggestions for the object.
Further, the description generation module uses a layer of multi-head cross attention module to fuse the target object and the enhanced target suggestions of other target objects, and then uses a full connection layer and word prediction module to generate each text in the description sentence one by one.
Further, the description generation module selects the first K objects closest to the object target object as other target objects except the target object by using a K nearest neighbor strategy.
Further, the visual localization module, which consists of a layer of multi-head cross attention, finally uses a classifier to generate a confidence score for each object suggestion, and takes the object with the highest predicted score as the final result.
Compared with the prior art, the invention has the following beneficial effects:
the method has the advantages that through the combined training of the three-dimensional point cloud visual positioning and the description generation task, the model can fully learn the relation characteristics among objects from the visual positioning task and can also learn the fine-grained characteristics of the objects from the description generation task; the method comprises the following steps of (1) adopting a trained combined frame to realize description generation and visual positioning of objects in a scene needing three-dimensional point cloud visual positioning and description generation tasks; the method can be beneficial to the development of AR/VR industry, and the combined application of the AR/VR industry and the model can adapt to more practical application scenes, so that the life of people is facilitated.
Drawings
FIG. 1 is a flow chart of a unified method for deep learning based three-dimensional description generation and visual localization;
FIG. 2 is a federated framework and a flow chart;
FIG. 3 is a block diagram illustrating the structure and flow of the generation module and the visual positioning module.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "connected," and the like are to be construed broadly, such as "connected," which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The three-dimensional description generation task is more object-oriented, and tends to learn more attribute information of a target object (i.e. an object of interest) in a scene; the three-dimensional vision basic task is more relationship-oriented, and the three-dimensional vision basic task focuses more on the relationship between object objects. Based on the above, the unified method for generating three-dimensional description and visually positioning based on deep learning provided by the invention utilizes a simple and powerful network structure to jointly solve the two different but closely related tasks (three-dimensional description generation and visual positioning) in a unified framework. The joint training of the two tasks is realized by using one three-dimensional object detector which is independent of the tasks, namely a target detection module, an attribute and relation feature perception enhancement module and two lightweight task specific modules, namely a description generation module and a visual positioning module.
Referring to fig. 1, the present invention provides a unified method for generating three-dimensional description and visually positioning based on deep learning, including:
s10, acquiring related point cloud data and corresponding text data of a preset scene, and dividing the point cloud data and the corresponding text data into a training set, a cross validation set and a test set according to a preset proportion;
s20, inputting the related point cloud data in the training set into the target detection module, positioning the object in the scene and generating an initial object suggestion;
s30, inputting the initial object suggestions into the feature perception enhancement module to generate corresponding enhancement target suggestions;
s40, inputting the enhancement target suggestions into a description generation module and a visual positioning module respectively; the description generation module converts the enhanced target suggestions into text features and generates description sentences corresponding to the object suggestions; the visual positioning module fuses the corresponding text data and the enhancement target suggestion to generate the position of the described object;
s50, performing iterative training on the combined frame, and performing verification and test by adopting the cross verification set and the test set;
and S60, carrying out visual positioning and description generation on the scene needing the three-dimensional point cloud task by adopting the trained combined frame to realize the description generation and the visual positioning of the object in the scene.
In step S10, the preset scene refers to, for example, an AR/VR scene, and specifically includes: the method is applied to the fields of VR games, AR furniture selection, indoor virtual navigation and the like, and can be applied to other fields based on 3D scenes. The point cloud data in the scene and the corresponding text data can be obtained through the laser scanner, for example, with AR interior decoration preview, the point cloud data in the room, the related text data of indoor ornaments, furniture, home appliances and the like need to be obtained, and for example, the point cloud data, the related text data of indoor ornaments, furniture, home appliances and the like are divided into a training set, a cross validation set and a test set according to a ratio of 7: 2: 1.
In step S20, the point cloud data in the training set is input into the target detection module, and an initial object suggestion is detected: the target detection module uses the existing high-efficiency VoteNet to cluster the point cloud, and then generates an initial object suggestion by predicting the distance between the central point and each side of the target object by using the idea of FCOS.
In step S30, the initial object suggestion is input to the feature perception enhancement module of the attribute and relationship to obtain an enhanced target suggestion: the feature perception enhancement module consists of two stacked multi-head self-attention layers (multi-head self-attention layers), and comprises an additional attribute coding module and a relationship coding module, wherein the attribute coding module and the relationship coding module consist of a plurality of fully-connected layers, the attribute coding module is used for coding some characteristics (such as information of color, size, shape and the like) of an object, and the relationship coding module is used for coding information of distance and the like between every two objects.
In step S40, the description generation module is used to convert the enhanced target suggestion features obtained from the feature perception enhancement module into text features, and finally generate a descriptive sentence corresponding to each object suggestion. The description generation module uses a layer of multi-head cross attention module (multi-head cross-attention) to fuse the target object suggestions and local other object suggestions, and then uses a full connection layer and word prediction module to generate each word or text in the description sentence one by one.
And (4) fusing the text data corresponding to the input obtained in the step (S10) with the enhanced target suggestion features obtained from the feature perception enhancement module by using a visual positioning module, and finding out the position of the described object. The visual positioning module also consists of a layer of multi-head cross attention (multi-head cross-attentions), and finally, a classifier is used for generating a confidence score of each object suggestion, and the object with the highest predicted score is used as a final result.
S50-S60, performing iterative training on the combined frame, and performing verification and test by adopting a cross verification set and a test set; and then, the scene needing the three-dimensional point cloud visual positioning and description generation task adopts the trained combined frame to realize the description generation and the visual positioning of the object in the scene.
The visual positioning technology can enable the model to find the corresponding position of the object described by the language from a complex scene to assist the robot in navigation and positioning; the description generation task can enable the model to generate a corresponding description for each object in the scene, and helps AR/VR industry development.
In this embodiment, a VoteNet target detection module is used and an improved bounding box modeling method is used to encode the point cloud to more accurately locate the object and generate the initial object suggestions. Then, the suggestion features are enhanced by a task-independent feature perception enhancement module to generate enhanced target suggestions. And finally, inputting the enhanced object suggestions into a description generation module and a visual positioning module of the intensive description generation task and the visual positioning task respectively, and generating a final result for each task.
According to the method, through the joint training of the three-dimensional point cloud visual positioning and the description generation task, the model can fully learn the relation characteristics among objects from the visual positioning task and can also learn the fine-grained characteristics of the objects from the description generation task; and (3) adopting a trained combined frame to realize the description generation and the visual positioning of objects in the scene for the scene needing the three-dimensional point cloud visual positioning and description generation task.
The following combination frameworks are described in detail in conjunction with the following figures:
as shown in fig. 2 (a), the combi-frame consists of three modules: 1) A target detection module; 2) A feature perception enhancement module for attributes and relationships; and 3) a task specific description generation module and a visual positioning module. The target detection module and the feature perception enhancement module are both task-independent modules, since they have no specific association with the subsequent task and can be shared by both tasks. The description generation module and the visual positioning module are task-specific Transformer-based lightweight network structures and are used for describing the generation and the visual positioning tasks respectively.
The 3D description generation task enables the model to learn more sufficient fine-grained features of the model, and the 3D visual positioning task enables the model to learn more sufficient relational features. Through a joint training strategy, the two tasks can help each other, the description generation task can help the visual positioning task to learn more self attributes (information such as size, color and shape), and the visual positioning task can help the description generation task to learn more complex relation information among objects.
As shown in fig. 2 (b), the feature perception enhancement module for attributes and relationships is composed of two multi-head self-attention layers (multi-head self-attentions) stacked one on top of the other, and includes an additional attribute coding module and a relationship coding module, where the attribute coding module and the relationship coding module are composed of several fully-connected layers, the attribute coding module is used to code some characteristics of the object itself (such as information of color, size, etc.), and the relationship coding module is used to code information such as distance between every two objects.
As shown in fig. 2 (c), the description generation module uses a layer of multi-head cross-attention module (multi-head cross-attention) to fuse the target object suggestions and local other object suggestions. A fully connected layer and word prediction module are then used to describe each word in the sentence that is generated one by one.
As shown in fig. 2 (d), the visual positioning module also consists of a layer of multi-head cross-attention (multi-head cross-attention), and finally uses a classifier to generate a confidence score for each object suggestion, and uses the object with the highest predicted score as the final result.
Specifically, as shown in (a) (b) in fig. 3, for the description generation module, for the Key and Value input by the cross attention module, by using the K nearest neighbor policy, the top K objects closest to the target object are selected according to their center distances in the three-dimensional coordinate space, so as to filter out the objects with smaller relevance in the scene. The proposed features of the selected object are used as Key and Value of the multi-headed Cross attention Module. In practice, k is set to 20 empirically, for example. This strategy is specifically designed for the description generation task because it focuses primarily on the most obvious (or dominant) relationships between the target object and its surrounding objects, and the remaining relationship information may be less important to the description generation task.
The concept of query, key & value is derived from a recommendation system, and the basic principle is as follows: given a query, calculating the correlation between the query and the key, and then finding the most appropriate value according to the correlation between the query and the key. For example: in AR interior decoration recommendations. query is information of one's preference for decoration (such as decoration style, age, sex, etc.), key is type of decoration (European style, chinese style, etc.), and value is decoration suggestion to be recommended. In this example, although each attribute of query, key and value is in a different space, they have a certain potential relationship, that is, the attributes of the three can be in a similar space through some transformation.
As shown in fig. 3 (c): for the visual localization module, the input Key and Value of the cross attention module are generated based on the textual language description of the input. Specifically, pre-trained Global Vectors for Word retrieval models and Gated Recursive Units (GRU) models may be used to extract features of text. The output word features of the GRU form Key and Value. In addition, the GRU generates a global linguistic feature to predict the class of the object described in each sentence. The features of the object suggestions are used as Query inputs. Finally, a confidence score is generated for each object suggestion using the basic classifier, and the object suggestion with the highest predicted score is used as the final visual positioning result.
The unified method for generating three-dimensional description and visually positioning based on deep learning provided by the invention is further illustrated by a specific embodiment:
1. taking various scenes (VR game field, AR furniture selection and virtual navigation) as an example, the point clouds and the corresponding text data in the various scenes can be divided into a training set, a cross validation set and a test set according to the ratio of 7: 2: 1.
2. And inputting the point cloud data P in the training set to the target detection module, wherein the point cloud data P belongs to N x (3 + K), and the points in the N input point clouds not only contain 3-dimensional XYZ point cloud coordinate information, but also contain 1-dimensional object height information, 3-dimensional normal vector information and feature information obtained by 128-dimensional 2D semantic segmentation to jointly form K = 132-dimensional auxiliary attribute features. The point cloud is subjected to feature extraction by using PointNet + +, and then the point cloud is clustered by using a voting and grouping module according to VoteNet, so that an object suggestion (object) center is obtained. Then, by taking the idea of FCOS as a reference, the distance from the central point to each side of the object boundary box is predicted to obtain the initial object suggestion boundary box and the 128-dimensional initial object suggestion feature.
3. And inputting the initial object suggestion into a feature perception enhancement module of the attribute and the relationship to obtain an enhanced object suggestion. The following is the detailed structure and working principle of the feature perception enhancement module:
in order to enable the characteristics of each object to learn clearer self characteristics and fully model complex relations among the objects, the characteristic perception enhancement module can be designed into a structure similar to a transform-encoder, and mainly comprises two multi-head self-attention layers, wherein the multi-head self-attention layers comprise an attribute coding module and a relation coding module. The input features Query, key and Value from the attention module are all the features suggested by the object.
An attribute encoding module: in order to aggregate the attribute feature and the initial object feature, the related features of the bounding box, i.e., the 27-dimensional bounding box and the central point coordinate feature (3-dimensional XYZ coordinates of 8 bounding box points and 1 central point), are spliced by using the full-link layer, and the previously input 132-dimensional auxiliary attribute feature is then encoded into a 128-dimensional attribute feature. This attribute feature is added to the 128-dimensional initial object suggestion feature to enhance the initial object suggestion feature.
A relationship encoding module: the relative distance between any two object suggestions can be encoded to capture complex object relationships, not only is the euclidean distance between any two centers of an object suggestion (i.e., distance Dist e M x 1), but also the three pairs of distances between any two centers of an object suggestion in the x, y, z directions (i.e., distance [ Dx, dy, dz ] em x M x 3) to better capture object relationships in different directions, where M is the number of initial object suggestions. Then, they are stitched into spatial proximity matrices (Dx, dy, dz and Dist) aggregated along the channel dimensions, input into fully connected layers for encoding to generate a spatial relationship matrix, the output dimension H matches the number of attention heads in the multi-head attention module (in implementations, such as H = 4), and then the spatial relationship matrix is added to the similarity matrix generated by each head of the multi-head self-attention module (i.e., the so-called attention map).
4. And inputting the enhanced object suggestion into a description generation module to generate a corresponding description. The following is a description of the detailed structure and operation of the generation module.
The description generation head is mainly composed of a cross-attention layer. It is first necessary to select the object suggestions that need to generate description sentences, which may be all object suggestions in the usage scenario (after the non-maximization suppression (NMS) process) one by one as input in the test phase, and then generate each word of description generation step by step using a round-robin network structure. Then, for each object, the hidden feature output by the multi-head cross attention module and the word feature of the previous word (the true word in the training stage and the predicted word in the testing stage are the previous predicted words, here, in order to avoid a great difference between the training process and the testing process, an autoregressive strategy is adopted in the training process, specifically, for example, 10% of the true word is replaced by the predicted word) are fused with the current object suggestion feature as Query input by the cross attention module.
As shown in fig. 3, for the Key and Value input by the cross attention module, a K nearest neighbor strategy is used, and according to the central distance of the Key and Value in the three-dimensional coordinate space, the first K objects closest to the target object are selected, so as to filter out the objects with smaller relevance in the scene. The proposed features of the selected object are used as Key and Value of the multi-headed cross attention module.
Finally, the multi-headed cross-attention module is followed by a fully-connected layer and a simple word prediction module to predict each word in the title in a one-by-one manner.
5. Inputting the enhanced object suggestions and the corresponding text data input in step 1 into the visual positioning module, and finding corresponding object suggestions describing the enhanced object suggestions. The following is the detailed structure and working principle of the visual positioning module:
the 3D visual localization task is to localize the object of interest according to the language description, and the visual localization head is mainly focused on the matching between the given language description and the detected object suggestions.
As shown in FIG. 3, the object suggestions described in the language are located using a cross-attention module. The input Key and Value of the cross attention module are generated using the textual language description of the input. Specifically, a pre-trained Global Vectors for Word retrieval model and a Gated Recursive Unit (GRU) model can be used to extract features of the text. The output word features of the GRU form Key and Value. In addition, the GRU generates a global linguistic feature to predict the class of the object described in each sentence. The features of the object suggestions are used as Query inputs. Finally, a confidence score is generated for each object suggestion using the basic classifier, and the object suggestion with the highest predicted score is used as the final visual positioning result.
The following is the effect of the method (Ours) provided by the invention on a plurality of data sets (ScanRefer, scan2 Cap), which achieves the best performance (bold) compared with other methods:
TABLE 1 visual localization results on ScanRefer dataset
Figure BSA0000276668350000111
Table 2 generation results are described on the scan2cap dataset
Figure BSA0000276668350000121
According to the method provided by the invention, through the joint training of the three-dimensional point cloud visual positioning and description generation tasks, the model can fully learn the relation characteristics among the objects from the visual positioning task and also can learn the fine-grained characteristics of the objects from the description generation task; the method comprises the following steps of (1) adopting a trained combined frame to realize description generation and visual positioning of objects in a scene needing three-dimensional point cloud visual positioning and description generation tasks; the method can be beneficial to the development of AR/VR industry, and the combined application of the AR/VR industry and the model can adapt to more practical application scenes, so that the life of people is facilitated.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (6)

1. A three-dimensional description generation and visual positioning unified method based on deep learning is characterized in that a joint framework is used for realizing three-dimensional description generation and joint information output of point cloud visual positioning in a complex scene; the joint frame includes: the system comprises a target detection module, a feature perception enhancement module, a description generation module and a visual positioning module; the method comprises the following steps:
acquiring related point cloud data and corresponding text data of a preset scene, and dividing the point cloud data and the corresponding text data into a training set, a cross validation set and a test set according to a preset proportion;
inputting the related point cloud data in the training set into the target detection module, positioning objects in a scene and generating an initial object suggestion;
inputting the initial object suggestion into the feature perception enhancement module to generate a corresponding enhancement target suggestion;
inputting the enhancement target suggestions into a description generation module and a visual positioning module respectively; the description generation module converts the enhanced target suggestions into text features and generates description sentences corresponding to the object suggestions; the visual positioning module fuses the corresponding text data and the enhancement target suggestion to generate the position of the described object;
performing iterative training on the combined framework, and performing verification and test by adopting the cross verification set and the test set;
and (3) adopting a trained combined frame to realize the description generation and the visual positioning of objects in the scene for the scene needing the three-dimensional point cloud visual positioning and description generation task.
2. The method of claim 1, wherein the target detection module employs a VoteNet object detection module, predicts distances from a center point to each measurement, encodes point clouds, locates objects in a scene, and generates an initial object suggestion.
3. The unified method for three-dimensional description generation and visual localization based on deep learning of claim 1, wherein the feature perception enhancement module is composed of two stacked multi-headed self-attention layers, which contain additional attribute coding module and relationship coding module;
the attribute coding module and the relation coding module are both composed of a plurality of full connection layers; the attribute coding module is used for coding the characteristics of the object, wherein the characteristics comprise: color, size and shape information; the relation coding module is used for coding the distance information between every two objects; the attribute codes and the relationship codes constitute enhanced target suggestions for the object.
4. The unified method for three-dimensional description generation and visual localization based on deep learning of claim 1, wherein the description generation module uses a layer-multi-head cross attention module to fuse the enhanced target suggestions of the target object and other target objects, and then uses a fully-connected layer and word prediction module to generate each text in the description sentence one by one.
5. The method as claimed in claim 4, wherein the description generation module selects the top K objects nearest to the target object of the object as other target objects except the target object by using a K nearest neighbor strategy.
6. The unified method for deep learning based three-dimensional description generation and visual localization as claimed in claim 1, wherein the visual localization module is composed of a layer of multi-head cross attention, and finally uses a classifier to generate confidence score of each object suggestion, and takes the object with highest predicted score as the final result.
CN202210739467.2A 2022-05-31 2022-05-31 Three-dimensional description generation and visual positioning unified method based on deep learning Pending CN115169448A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210739467.2A CN115169448A (en) 2022-05-31 2022-05-31 Three-dimensional description generation and visual positioning unified method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210739467.2A CN115169448A (en) 2022-05-31 2022-05-31 Three-dimensional description generation and visual positioning unified method based on deep learning

Publications (1)

Publication Number Publication Date
CN115169448A true CN115169448A (en) 2022-10-11

Family

ID=83488225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210739467.2A Pending CN115169448A (en) 2022-05-31 2022-05-31 Three-dimensional description generation and visual positioning unified method based on deep learning

Country Status (1)

Country Link
CN (1) CN115169448A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385757A (en) * 2022-12-30 2023-07-04 天津大学 Visual language navigation system and method based on VR equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385757A (en) * 2022-12-30 2023-07-04 天津大学 Visual language navigation system and method based on VR equipment
CN116385757B (en) * 2022-12-30 2023-10-31 天津大学 Visual language navigation system and method based on VR equipment

Similar Documents

Publication Publication Date Title
CN111858954B (en) Task-oriented text-generated image network model
Chen et al. Pointgpt: Auto-regressively generative pre-training from point clouds
KR102124466B1 (en) Apparatus and method for generating conti for webtoon
CN111026842A (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN110516096A (en) Synthesis perception digital picture search
CN112734881B (en) Text synthesized image method and system based on saliency scene graph analysis
CN112085072B (en) Cross-modal retrieval method of sketch retrieval three-dimensional model based on space-time characteristic information
CN109712108B (en) Visual positioning method for generating network based on diversity discrimination candidate frame
CN114663915B (en) Image human-object interaction positioning method and system based on transducer model
Wang et al. Spatiality-guided transformer for 3d dense captioning on point clouds
CN110599592A (en) Three-dimensional indoor scene reconstruction method based on text
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN111125396B (en) Image retrieval method of single-model multi-branch structure
CN111651635A (en) Video retrieval method based on natural language description
CN116975349A (en) Image retrieval method, device, electronic equipment and storage medium
CN115169448A (en) Three-dimensional description generation and visual positioning unified method based on deep learning
CN114586075A (en) Visual object instance descriptor for location identification
KR102068489B1 (en) 3d object creation apparatus
Toshevska et al. Exploration into deep learning text generation architectures for dense image captioning
KR20200073967A (en) Method and apparatus for determining target object in image based on interactive input
CN112101154B (en) Video classification method, apparatus, computer device and storage medium
Yu et al. A novel multi-feature representation of images for heterogeneous IoTs
CN110851629A (en) Image retrieval method
CN106055244A (en) Man-machine interaction method based on Kincet and voice
Zhao et al. Localization and completion for 3D object interactions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination