CN115169448A

CN115169448A - A unified approach to 3D description generation and visual localization based on deep learning

Info

Publication number: CN115169448A
Application number: CN202210739467.2A
Authority: CN
Inventors: 盛律; 徐东; 赵立晨; 蔡代刚
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-10-11

Abstract

The invention discloses a three-dimensional description generation and visual positioning unified method based on deep learning, which uses a united frame to realize the three-dimensional description generation and the united information output of point cloud visual positioning in a complex scene; the combi-frame comprises: the system comprises a target detection module, a feature perception enhancement module, a description generation module and a visual positioning module. According to the method, through the joint training of the three-dimensional point cloud visual positioning and the description generation task, the model can fully learn the relation characteristics among objects from the visual positioning task and can also learn the fine-grained characteristics of the objects from the description generation task; the method comprises the following steps of enabling a scene needing three-dimensional point cloud visual positioning and description generation tasks to adopt a trained combined frame to realize description generation and visual positioning of objects in the scene; the method can be beneficial to the development of AR/VR industry, and the combined application of the AR/VR industry and the model can adapt to more practical application scenes, so that the life of people is facilitated.

Description

A unified approach to 3D description generation and visual localization based on deep learning

技术领域technical field

本发明涉及点云描述生成，视觉定位及深度学习领域，特别涉及一种基于深度学习的三维描述生成和视觉定位的统一方法。The invention relates to the fields of point cloud description generation, visual positioning and deep learning, in particular to a unified method for three-dimensional description generation and visual positioning based on deep learning.

背景技术Background technique

对于室内机器人导航技术，感知室内复杂环境是必不可少的一步。机器人需要有理解场景的能力，从复杂的环境找到我们需要的物体或要导航到的位置。在AR/VR中，为三维模型空间中的每个物体生成对应的描述是一项很新的技术，在VR游戏领域、AR选家具、虚拟导航等场景下都有应用。为了进一步提升3D场景和自然语言的交互能力，研究者们提出了视觉定位和描述生成任务。For indoor robot navigation technology, sensing the complex indoor environment is an essential step. Robots need to have the ability to understand the scene, find the object we need or the location to navigate from the complex environment. In AR/VR, generating a corresponding description for each object in the 3D model space is a very new technology, which has applications in the field of VR games, AR furniture selection, virtual navigation and other scenarios. To further improve the interaction ability between 3D scenes and natural language, researchers propose visual localization and description generation tasks.

视觉定位任务指的是，输入一个文本描述和一个场景，让模型能够在场景中定位出文本所描述的物体所在的位置，描述生成任务指的是，输入一个场景，可以对场景中的任一个物体，生成一条相对应的文本描述。在三维点云视觉定位和描述生成领域，以往的方法都是在不同的网络中分别实现这两个任务。The visual localization task refers to inputting a text description and a scene, so that the model can locate the location of the object described by the text in the scene. The description generation task refers to inputting a scene, which can Object, generate a corresponding text description. In the field of 3D point cloud visual localization and description generation, previous methods have implemented these two tasks in different networks.

目前的三维点云视觉定位和描述生成方法大多数由两个阶段组成。在第一阶段中，使用三维对象检测器或全景分割模型从输入场景生成对象建议和候选框；在第二阶段，视觉定位模型使用输入的文本描述来定位到与其最匹配的对象建议，描述生成模型则是对于每一个对象建议，都生成一条相对应的描述语句。但是现有的三维视觉定位技术并未很好地考虑不同对象间的关系，三维描述生成技术却忽略了物体自身的外观表征信息等。Most of the current 3D point cloud visual localization and description generation methods consist of two stages. In the first stage, a 3D object detector or panoramic segmentation model is used to generate object proposals and candidate boxes from the input scene; in the second stage, the visual localization model uses the input text description to locate the object proposal that best matches it, and the description generates The model generates a corresponding description statement for each object suggestion. However, the existing 3D visual positioning technology does not take into account the relationship between different objects well, and the 3D description generation technology ignores the appearance representation information of the object itself.

因此，目前三维点云视觉定位和描述生成方法，并不存在联合应用，无法适应更多AR/VR实际应用场景；并且以往方法都是使用不同的网络，每次只能实现一个任务，在性能上也存在缺陷。比如，专利文献公开号为：CN113657478A的技术通过关系建模对不同物体间的关系进行增强，但该文献的方法强依赖于视觉定位问题本身，无法扩展到描述生成任务。Therefore, the current 3D point cloud visual positioning and description generation methods do not have joint applications and cannot adapt to more AR/VR practical application scenarios; and the previous methods use different networks, and can only achieve one task at a time. There are also flaws. For example, the patent document publication number: CN113657478A enhances the relationship between different objects through relational modeling, but the method in this document is strongly dependent on the visual localization problem itself and cannot be extended to description generation tasks.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于深度学习的三维描述生成和视觉定位的统一方法，可解决上述不足，采用联合框架，实现场景内物体的描述生成和视觉定位，在实现这两个任务的同时，达到两个任务互相帮助的效果。The purpose of the present invention is to provide a unified method of 3D description generation and visual positioning based on deep learning, which can solve the above deficiencies, and adopts a joint framework to realize the description generation and visual positioning of objects in the scene. , to achieve the effect that the two tasks help each other.

为实现上述目的，本发明采取的技术方案为：To achieve the above object, the technical scheme adopted in the present invention is:

本发明提供一种基于深度学习的三维描述生成和视觉定位的统一方法，使用一个联合框架来实现复杂场景中的三维描述生成和点云视觉定位的联合信息输出；所述联合框架包括：目标检测模块、特征感知增强模块、描述生成模块和视觉定位模块；该方法包括：The present invention provides a unified method for 3D description generation and visual positioning based on deep learning, using a joint framework to realize joint information output of 3D description generation and point cloud visual positioning in complex scenes; the joint framework includes: target detection module, feature perception enhancement module, description generation module and visual localization module; the method includes:

获取预设场景的相关点云数据和对应的文本数据，根据预设比例划分为训练集、交叉验证集和测试集；Obtain the relevant point cloud data and the corresponding text data of the preset scene, and divide it into a training set, a cross-validation set and a test set according to the preset ratio;

将所述训练集中的相关点云数据输入所述目标检测模块，对场景内的物体进行定位并生成初始对象建议；Input the relevant point cloud data in the training set into the target detection module, locate objects in the scene and generate initial object suggestions;

将所述初始对象建议输入所述特征感知增强模块，生成对应的增强目标建议；Inputting the initial object suggestion into the feature-aware enhancement module to generate a corresponding enhanced target suggestion;

将所述增强目标建议分别输入到描述生成模块和视觉定位模块；所述描述生成模块将所述增强目标建议转化为文本特征，生成每个对象建议相对应的描述语句；所述视觉定位模块将所述对应的文本数据和增强目标建议进行融合，生成所描述的物体所在的位置；The enhanced target suggestion is input into the description generation module and the visual positioning module respectively; the description generation module converts the enhanced target suggestion into text features, and generates a description sentence corresponding to each object suggestion; the visual positioning module will The corresponding text data and the enhanced target suggestion are fused to generate the position where the described object is located;

对所述联合框架进行迭代训练，并采用所述交叉验证集和测试集进行验证和测试；Iteratively train the joint framework, and use the cross-validation set and the test set for verification and testing;

将需要三维点云视觉定位和描述生成任务的场景，采用训练后的联合框架，实现场景内物体的描述生成和视觉定位。Scenes that require 3D point cloud visual localization and description generation tasks will use the trained joint framework to achieve description generation and visual localization of objects in the scene.

进一步地，所述目标检测模块采用VoteNet对象检测模块，预测中心点到各测的距离，对点云进行编码，定位场景内物体并生成初始对象建议。Further, the target detection module adopts the VoteNet object detection module to predict the distance from the center point to each measurement, encode the point cloud, locate objects in the scene and generate initial object suggestions.

进一步地，所述特征感知增强模块由堆叠的两层多头自注意力层组成，其中包含额外的属性编码模块和关系编码模块；Further, the feature-aware enhancement module is composed of stacked two-layer multi-head self-attention layers, which include additional attribute encoding modules and relation encoding modules;

所述属性编码模块和关系编码模块均由若干全连接层组成；所述属性编码模块的用于将物体自身的特性进行编码，所述特性包括：颜色、大小和形状信息；所述关系编码模块用于对每两个对象间的距离信息进行编码；属性编码和关系编码构成物体的增强目标建议。Both the attribute encoding module and the relation encoding module are composed of several fully connected layers; the attribute encoding module is used to encode the characteristics of the object itself, and the characteristics include: color, size and shape information; the relation encoding module It is used to encode the distance information between every two objects; attribute encoding and relation encoding constitute the object's enhanced object proposal.

进一步地，所述描述生成模块，使用一层多头交叉注意力模块融合目标对象和其他目标对象的所述增强目标建议后，使用一个全连接层和单词预测模块逐个生成描述语句中的每个文本。Further, the description generation module uses a layer of multi-head cross-attention module to fuse the enhanced target suggestions of the target object and other target objects, and then uses a fully connected layer and a word prediction module to generate each text in the description sentence one by one. .

进一步地，所述描述生成模块采用K最近邻策略，选择距离物体目标对象最近的前K个对象，作为目标对象外的其他目标对象。Further, the description generation module adopts the K nearest neighbor strategy, and selects the top K objects closest to the target object as other target objects other than the target object.

进一步地，所述视觉定位模块，由一层多头交叉注意力组成，最后使用一个分类器生成每个对象建议的置信度得分，并将预测得分最高的对象作为最终的结果。Further, the visual localization module consists of a layer of multi-head cross-attention, and finally uses a classifier to generate the confidence score suggested by each object, and uses the object with the highest predicted score as the final result.

与现有技术相比，本发明具有如下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

该方法通过对三维点云视觉定位和描述生成任务的联合训练，使得模型既可以从视觉定位任务中充分学习到物体间的关系特征，又能从描述生成任务中学习到物体自身的细粒度特征；将需要三维点云视觉定位和描述生成任务的场景，采用训练后的联合框架，实现场景内物体的描述生成和视觉定位；该方法可有助于AR/VR产业发展，这两者的联合应用可以使得模型可以适应更多的实际应用场景，方便人们的生活。Through the joint training of 3D point cloud visual localization and description generation tasks, the method enables the model not only to fully learn the relationship features between objects from the visual localization task, but also to learn the fine-grained features of the objects themselves from the description generation task. ;For scenes that require 3D point cloud visual positioning and description generation tasks, a joint framework after training is used to achieve description generation and visual positioning of objects in the scene; this method can help the development of AR/VR industry, the combination of the two The application can make the model adapt to more practical application scenarios, which is convenient for people's life.

附图说明Description of drawings

图1为基于深度学习的三维描述生成和视觉定位的统一方法流程图；Figure 1 is a flow chart of a unified method for 3D description generation and visual positioning based on deep learning;

图2为联合框架结构及流程图；Fig. 2 is the joint framework structure and flow chart;

图3为描述生成模块和视觉定位模块的结构及流程图。FIG. 3 is a structure and flow chart describing the generation module and the visual positioning module.

具体实施方式Detailed ways

为使本发明实现的技术手段、创作特征、达成目的与功效易于明白了解，下面结合具体实施方式，进一步阐述本发明。In order to make the technical means, creative features, achievement goals and effects realized by the present invention easy to understand, the present invention will be further described below with reference to the specific embodiments.

在本发明的描述中，需要说明的是，术语“上”、“下”、“内”、“外”“前端”、“后端”、“两端”、“一端”、“另一端”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "front end", "rear end", "two ends", "one end" and "the other end" The orientation or positional relationship indicated by etc. is based on the orientation or positional relationship shown in the accompanying drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must have a specific orientation, with a specific orientation. The orientation configuration and operation are therefore not to be construed as limitations of the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed to indicate or imply relative importance.

在本发明的描述中，需要说明的是，除非另有明确的规定和限定，术语“安装”、“设置有”、“连接”等，应做广义理解，例如“连接”，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that, unless otherwise expressly specified and limited, the terms "installed", "provided with", "connected", etc. should be understood in a broad sense, for example, "connected" may be a fixed connection It can also be a detachable connection or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or an indirect connection through an intermediate medium, or the internal communication between the two components. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood in specific situations.

三维描述生成任务更面向物体对象，它倾向于学习场景中目标对象(即感兴趣的物体对象)的更多属性信息；而三维视觉基础任务更面向关系，它更关注物体对象之间的关系。基于此，本发明提供的一种基于深度学习的三维描述生成和视觉定位的统一方法，在一个统一的框架下，利用简单而强大的网络结构，共同解决这两个(三维描述生成和视觉定位)不同但密切相关的任务。通过使用一个和任务无关的三维对象检测器，即目标检测模块、属性和关系特征感知增强模块，以及两个轻量级任务特定模块(即描述生成模块和视觉定位模块)实现两个任务的联合训练。The 3D description generation task is more object-oriented, and it tends to learn more attribute information of the target object (that is, the object of interest) in the scene; while the 3D vision basic task is more relation-oriented, and it pays more attention to the relationship between objects. Based on this, the present invention provides a unified method for 3D description generation and visual positioning based on deep learning. Under a unified framework, a simple and powerful network structure is used to jointly solve these two (3D description generation and visual positioning). ) different but closely related tasks. By using a task-independent 3D object detector, i.e. object detection module, attribute and relation feature-aware enhancement module, and two lightweight task-specific modules (i.e. description generation module and visual localization module), the two tasks are combined train.

参照图1所示，本发明提供的一种基于深度学习的三维描述生成和视觉定位的统一方法，包括：Referring to Figure 1, the present invention provides a unified method for deep learning-based three-dimensional description generation and visual positioning, including:

S10、获取预设场景的相关点云数据和对应的文本数据，根据预设比例划分为训练集、交叉验证集和测试集；S10. Obtain relevant point cloud data and corresponding text data of the preset scene, and divide them into a training set, a cross-validation set, and a test set according to a preset ratio;

S20、将所述训练集中的相关点云数据输入所述目标检测模块，对场景内的物体进行定位并生成初始对象建议；S20, input the relevant point cloud data in the training set into the target detection module, locate objects in the scene and generate initial object suggestions;

S30、将所述初始对象建议输入所述特征感知增强模块，生成对应的增强目标建议；S30, inputting the initial object suggestion into the feature-aware enhancement module to generate a corresponding enhanced target suggestion;

S40、将所述增强目标建议分别输入到描述生成模块和视觉定位模块；所述描述生成模块将所述增强目标建议转化为文本特征，生成每个对象建议相对应的描述语句；所述视觉定位模块将所述对应的文本数据和增强目标建议进行融合，生成所描述的物体所在的位置；S40. Input the enhanced target suggestion into a description generation module and a visual positioning module respectively; the description generation module converts the enhanced target suggestion into text features, and generates a description sentence corresponding to each object suggestion; the visual positioning The module fuses the corresponding text data and the enhanced target suggestion to generate the position where the described object is located;

S50、对所述联合框架进行迭代训练，并采用所述交叉验证集和测试集进行验证和测试；S50, performing iterative training on the joint framework, and using the cross-validation set and the test set for verification and testing;

S60、将需要三维点云视觉定位和描述生成任务的场景，采用训练后的联合框架，实现场景内物体的描述生成和视觉定位。S60. In a scene that requires three-dimensional point cloud visual positioning and description generation tasks, the trained joint framework is used to realize description generation and visual positioning of objects in the scene.

其中，步骤S10中，预设场景比如是指AR/VR场景，具体来说包括：VR游戏领域、AR选家具、室内虚拟导航等领域，其他基于3D场景的应用均可。可通过激光扫描仪获得场景内的点云数据、并获得其对应的文本数据，以AR室内装修预览为例，需要获得室内的点云数据、室内摆件、家具、家电等的相关文本数据等，比如将其按照7∶2∶1进行划分为训练集、交叉验证集和测试集。Wherein, in step S10, the preset scene, for example, refers to an AR/VR scene, specifically including: VR game field, AR furniture selection, indoor virtual navigation and other fields, and other applications based on 3D scenes are acceptable. The point cloud data in the scene can be obtained through a laser scanner, and the corresponding text data can be obtained. Taking AR interior decoration preview as an example, it is necessary to obtain indoor point cloud data, indoor decorations, furniture, home appliances, etc. related text data, etc., For example, it is divided into training set, cross-validation set and test set according to 7:2:1.

在步骤S20中，将训练集中的点云数据输入目标检测模块，检测出初始对象建议：该目标检测模块使用目前已有的高效方法VoteNet对点云进行聚类，然后借鉴FCOS的思想，通过预测中心点和目标对象每一侧之间的距离来生成初始对象建议。In step S20, the point cloud data in the training set is input into the target detection module to detect the initial object suggestion: the target detection module uses the existing high-efficiency method VoteNet to cluster the point cloud, and then draws on the idea of FCOS to predict the The distance between the center point and each side of the target object to generate initial object proposals.

在步骤S30中，将初始对象建议输入属性和关系的特征感知增强模块，得到增强目标建议：该特征感知增强模块由堆叠的两层多头自注意力层(multi-head self-attention)组成，其中包含额外的属性编码模块和关系编码模块，其中，属性编码模块和关系编码模块由若干个全连接层组成，属性编码模块的作用是为了将物体自身的一些特性(比如颜色，大小、形状等信息)进行编码，而关系编码模块，则是对每两个对象间的距离等信息进行编码。In step S30, the initial object proposal is input into a feature-aware enhancement module of attributes and relationships, and an enhanced target proposal is obtained: the feature-aware enhancement module is composed of stacked two-layer multi-head self-attention layers, wherein Contains additional attribute encoding modules and relational encoding modules, wherein the attribute encoding module and the relational encoding module are composed of several fully connected layers. The function of the attribute encoding module is to convert some characteristics of the object itself (such as color, size, shape and other information). ) to encode, and the relational encoding module encodes information such as the distance between each two objects.

在步骤S40中，利用描述生成模块将从特征感知增强模块得到的增强目标建议特征转化为文本特征，最终生成每个对象建议相对应的描述语句。描述生成模块使用一层多头交叉注意力模块(multi-head cross-attention)融合目标对象建议和局部的其他对象建议，之后使用一个全连接层和单词预测模块逐个生成的描述语句中的每个单词或文本。In step S40, the enhanced target suggestion features obtained from the feature-aware enhancement module are converted into text features by the description generation module, and finally a description sentence corresponding to each object suggestion is generated. The description generation module uses a layer of multi-head cross-attention to fuse target object proposals with other local object proposals, and then uses a fully connected layer and word prediction module to generate each word in the description sentence one by one. or text.

利用视觉定位模块，从步骤S10中获得的输入对应的文本数据和从特征感知增强模块得到的增强目标建议特征进行融合，找到所描述的物体所在的位置。视觉定位模块也是由一层多头交叉注意力(multi-head cross-attention)组成，最后使用一个分类器生成每个对象建议的置信度得分，并将预测得分最高的对象作为最终的结果。Using the visual positioning module, the text data corresponding to the input obtained in step S10 and the enhanced target suggestion feature obtained from the feature perception enhancement module are fused to find the position of the described object. The visual localization module also consists of a layer of multi-head cross-attention, and finally uses a classifier to generate a confidence score for each object proposal, and uses the object with the highest predicted score as the final result.

步骤S50-S60，则是对上述联合框架进行迭代训练，并采用交叉验证集和测试集进行验证和测试；然后将需要三维点云视觉定位和描述生成任务的场景，采用训练后的联合框架，实现场景内物体的描述生成和视觉定位。Steps S50-S60 are to iteratively train the above joint framework, and use the cross-validation set and the test set for verification and testing; and then use the trained joint framework for scenes that require 3D point cloud visual positioning and description generation tasks. Realize description generation and visual positioning of objects in the scene.

视觉定位技术可以让模型从复杂的场景内找到语言描述的物体对应位置来辅助机器人进行导航定位；描述生成任务能够让模型对场景内的每个物体生成一句对应的描述，帮助AR/VR产业发展，这两者的联合应用可以使得模型可以适应更多的实际应用场景，方便人们的生活。The visual positioning technology allows the model to find the corresponding position of the object described in the language from the complex scene to assist the robot in navigation and positioning; the description generation task allows the model to generate a corresponding description for each object in the scene, helping the development of the AR/VR industry , the combined application of the two can make the model adapt to more practical application scenarios and facilitate people's lives.

本实施例中，使用VoteNet目标检测模块，并且使用改进的边界框建模方法对点云进行编码，以更精确地定位物体并生成初始对象建议。然后，通过任务无关的特征感知增强模块增强建议特征，生成增强的目标建议。最后将增强的对象建议分别输入到密集描述生成任务和视觉定位任务的描述生成模块和视觉定位模块中，并为每个任务生成最终结果。In this example, the VoteNet object detection module is used, and the point cloud is encoded using an improved bounding box modeling method to more accurately locate objects and generate initial object proposals. Then, the proposed features are augmented by a task-agnostic feature-aware augmentation module to generate enhanced object proposals. Finally, the augmented object proposals are input into the description generation module and visual localization module of dense description generation task and visual localization task, respectively, and the final results are generated for each task.

该方法通过对三维点云视觉定位和描述生成任务的联合训练，使得模型既可以从视觉定位任务中充分学习到物体间的关系特征，又能从描述生成任务中学习到物体自身的细粒度特征；将需要三维点云视觉定位和描述生成任务的场景，采用训练后的联合框架，实现场景内物体的描述生成和视觉定位。Through the joint training of 3D point cloud visual localization and description generation tasks, the method enables the model not only to fully learn the relationship features between objects from the visual localization task, but also to learn the fine-grained features of the objects themselves from the description generation task. ; The scene that requires 3D point cloud visual localization and description generation task, adopts the joint framework after training to realize the description generation and visual localization of objects in the scene.

下面结合附图，具体来说明下联合框架：Below in conjunction with the accompanying drawings, the lower joint framework is specifically described:

如图2中(a)所示，联合框架由三个模块组成：1)目标检测模块；2)属性和关系的特征感知增强模块；以及3)任务特定的描述生成模块和视觉定位模块。目标检测模块和特征感知增强模块都是任务无关模块，因为它们与后续任务没有特定的关联关系，可以由两个任务共享。描述生成模块和视觉定位模块是特定于任务的基于Transformer的轻量级网络结构，分别用于描述生成和视觉定位任务。As shown in Fig. 2(a), the joint framework consists of three modules: 1) an object detection module; 2) a feature-aware enhancement module for attributes and relationships; and 3) a task-specific description generation module and a visual localization module. Both the object detection module and the feature-aware enhancement module are task-agnostic modules because they have no specific association with subsequent tasks and can be shared by the two tasks. The description generation module and the visual localization module are task-specific lightweight Transformer-based network structures for description generation and visual localization tasks, respectively.

3D描述生成任务使得模型可以学习到更充分的自身细粒度特征，3D视觉定位任务使得模型学习到更充分的关系特征。可通过联合训练的策略，使得两个任务可以互相帮助，描述生成任务可以帮助视觉定位任务学习到更多自身的属性(大小，颜色，形状等信息)，而视觉定位任务可以帮助描述生成任务学习到更多物体间的复杂关系信息。The 3D description generation task enables the model to learn more sufficient fine-grained features, and the 3D visual localization task enables the model to learn more sufficient relational features. Through the joint training strategy, the two tasks can help each other. The description generation task can help the visual localization task learn more of its own attributes (size, color, shape, etc.), and the visual localization task can help the description generation task learn. to more complex relationship information between objects.

如附图2中(b)所示，属性和关系的特征感知增强模块由堆叠的两层多头自注意力层(multi-head self-attention)组成，其中包含额外的属性编码模块和关系编码模块，其中，属性编码模块和关系编码模块由几个全连接层组成，属性编码模块的作用是为了将物体自身的一些特性(比如颜色，大小等信息)进行编码，而关系编码模块，则是对每两个对象间的距离等信息进行编码。As shown in Fig. 2(b), the feature-aware enhancement module for attributes and relations consists of a stacked two-layer multi-head self-attention layer, which contains additional attribute encoding modules and relation encoding modules , among them, the attribute encoding module and the relation encoding module are composed of several fully connected layers. The function of the attribute encoding module is to encode some characteristics of the object itself (such as color, size and other information), while the relation encoding module is to encode Information such as the distance between each two objects is encoded.

如图2中(c)所示，描述生成模块使用一层多头交叉注意力模块(multi-headcross-attention)融合目标对象建议和局部的其他对象建议。之后使用一个全连接层和单词预测模块逐个生成的描述语句中的每个单词。As shown in Fig. 2(c), the description generation module uses a layer of multi-head cross-attention module to fuse target object proposals and local other object proposals. Each word in the description sentence is then generated one by one using a fully connected layer and a word prediction module.

如附图2中(d)所示，视觉定位模块也是由一层多头交叉注意力(multi-headcross-attention)组成，最后使用一个分类器生成每个对象建议的置信度得分，并将预测得分最高的对象作为最终的结果。As shown in Fig. 2(d), the visual localization module is also composed of a layer of multi-head cross-attention, and finally a classifier is used to generate the confidence score of each object proposal, and the predicted score The tallest object is the final result.

具体地，如图3中(a)(b)所示，针对描述生成模块，对于交叉注意力模块输入的Key和Value，通过使用K最近邻策略，根据其在三维坐标空间中的中心距离，选择距离目标对象最近的前K个对象，从而过滤掉场景中相关性较小的对象。选定的对象建议的特征被用作多头交叉注意力模块的Key和Value。在具体实施时，比如k根据经验设定为20。这种策略是专门为描述生成任务设计的，因为它主要关注目标对象与其周围对象之间最明显(或主要)的关系，其余的关系信息可能对描述生成任务不那么重要。Specifically, as shown in (a) and (b) in Figure 3, for the description generation module, for the Key and Value input by the cross attention module, by using the K nearest neighbor strategy, according to its center distance in the three-dimensional coordinate space, The top K objects closest to the target object are selected, thereby filtering out the less relevant objects in the scene. The features proposed by the selected objects are used as the Key and Value of the multi-head cross-attention module. In a specific implementation, for example, k is set to 20 according to experience. This strategy is specially designed for description generation tasks, as it mainly focuses on the most obvious (or dominant) relationships between the target object and its surrounding objects, and the rest of the relational information may be less important for description generation tasks.

其中，上述的query、key&value的概念来源于推荐系统，基本原理是：给定一个query，计算query与key的相关性，然后根据query与key的相关性去找到最合适的value。举个例子：在AR室内装修推荐中。query是某个人对装修的喜好信息(比如装修风格、年龄、性别等)、key是装修的类型(欧式、中式等)、value就是待推荐的装修建议。在这个例子中，query，key和value的每个属性虽然在不同的空间，其实他们是有一定的潜在关系的，也就是说通过某种变换，可以使得三者的属性在一个相近的空间中。Among them, the above concepts of query, key & value come from the recommendation system. The basic principle is: given a query, calculate the correlation between the query and the key, and then find the most suitable value according to the correlation between the query and the key. For example: in AR interior decoration recommendation. The query is a person's preference information for decoration (such as decoration style, age, gender, etc.), the key is the type of decoration (European style, Chinese style, etc.), and the value is the decoration suggestion to be recommended. In this example, although each attribute of query, key and value is in a different space, they actually have a certain potential relationship, that is to say, through some transformation, the attributes of the three can be made in a similar space .

如图3中(c)所示：对于视觉定位模块，交叉注意力模块的输入Key和Value是基于输入的文本语言描述生成的。具体来说，可使用预训练的GloVE(Global Vectors for WordRepresentation)模型以及门控循环单元(GRU，Gated Recurrent Unit)模型来提取出文本的特征。GRU的输出的单词特征形成了Key和Value。此外，GRU还生成了一个全局语言特征来预测每个句子所描述物体的类别。对象建议的特征被用作Query输入。最后，使用基本的分类器生成每个对象建议的置信度得分，并将预测得分最高的对象建议作为最终的视觉定位结果。As shown in Figure 3(c): For the visual localization module, the input Key and Value of the cross-attention module are generated based on the input text language description. Specifically, a pre-trained GloVE (Global Vectors for Word Representation) model and a Gated Recurrent Unit (GRU, Gated Recurrent Unit) model can be used to extract text features. The word features output by the GRU form Key and Value. In addition, GRU also generates a global language feature to predict the class of the object described by each sentence. The features proposed by the object are used as Query input. Finally, a basic classifier is used to generate a confidence score for each object proposal, and the object proposal with the highest predicted score is used as the final visual localization result.

再通过一个具体实施例来说明本发明所提供的基于深度学习的三维描述生成和视觉定位的统一方法：The unified method for generating 3D descriptions and visual positioning based on deep learning provided by the present invention will be described through a specific embodiment:

1.以多种场景(VR游戏领域、AR选家具、虚拟导航)为例，可将多种场景中的点云及对应的文本数据，按照比例7∶2∶1划分为训练集、交叉验证集、测试集。1. Taking various scenarios (VR game field, AR furniture selection, virtual navigation) as an example, the point cloud and corresponding text data in various scenarios can be divided into training set and cross-validation according to the ratio of 7:2:1 set and test set.

2.将训练集中的点云数据P∈N×(3+K)输入目标检测模块，其中N个输入的点云中的点不仅包含3维XYZ点云坐标信息，还包括1维物体高度信息、3维法向量信息和128维2D语义分割得到的特征信息共同组建成的K＝132维辅助属性特征。通过使用PointNet++来对点云进行特征提取，然后依照VoteNet，使用投票以及分组模块对点云进行聚类，得到对象建议(物体)中心。然后借鉴FCOS的思想，通过预测中心点到对象边界框每一侧的距离，来得到初始对象建议边界框以及128维的初始对象建议特征。2. Input the point cloud data P∈N×(3+K) in the training set into the target detection module, where the points in the N input point clouds not only contain 3D XYZ point cloud coordinate information, but also 1D object height information , 3-dimensional normal vector information and feature information obtained from 128-dimensional 2D semantic segmentation together form K=132-dimensional auxiliary attribute feature. By using PointNet++ to perform feature extraction on the point cloud, and then according to VoteNet, use the voting and grouping modules to cluster the point cloud, and get the object suggestion (object) center. Then, drawing on the idea of FCOS, the initial object proposal bounding box and 128-dimensional initial object proposal features are obtained by predicting the distance from the center point to each side of the object bounding box.

3.将初始对象建议输入属性和关系的特征感知增强模块，得到增强对象建议。以下是特征感知增强模块的详细结构和工作原理：3. Input the initial object proposal into the feature-aware enhancement module of attributes and relationships, and obtain the enhanced object proposal. The following is the detailed structure and working principle of the feature-aware enhancement module:

为了让每个对象特征学习到更加清晰的自身特性以及对对象之间的复杂关系充分建模，可将特征感知增强模块设计为类似Transformer-encoder的结构，主要有两层多头自注意力层组成，其中包含属性编码模块和关系编码模块。自注意力模块的输入特征Query，Key，Value均为对象建议的特征。In order to allow each object feature to learn its own characteristics more clearly and fully model the complex relationship between objects, the feature-aware enhancement module can be designed as a structure similar to Transformer-encoder, which is mainly composed of two layers of multi-head self-attention layers. , which contains the attribute encoding module and the relation encoding module. The input features Query, Key, and Value of the self-attention module are the features suggested by the object.

属性编码模块：为了聚合属性特征和初始对象特征，通过使用全连接层将边界框的相关特征，即将27维边框以及中心点坐标特征(8个边框点和1个中心点的3维XYZ坐标)，之前输入的132维辅助属性特征进行拼接，然后编码成128维的属性特征。将这属性特征添加到128维的初始对象建议特征中，用以增强初始对象建议特征。Attribute encoding module: In order to aggregate attribute features and initial object features, the relevant features of the bounding box, that is, the 27-dimensional bounding box and the center point coordinate features (3-dimensional XYZ coordinates of 8 bounding box points and 1 center point) are combined by using a fully connected layer. , the previously input 132-dimensional auxiliary attribute features are spliced, and then encoded into 128-dimensional attribute features. This attribute feature is added to the 128-dimensional initial object proposal feature to enhance the initial object proposal feature.

关系编码模块：可对任意两个对象建议之间的相对距离进行编码，以捕捉复杂的对象关系，不仅编码对象建议任意两个中心之间的欧几里德距离(即距离Dist∈M×M×1)，而且编码对象建议的任意两个中心之间沿x、y、z方向的三对距离(即距离[Dx，Dy，Dz]∈M×M×3)来更好地捕捉不同方向上的对象关系，其中M是初始对象建议的数量。然后，将其拼接为空间邻近矩阵(Dx、Dy、Dz和Dist)沿通道维度聚合，输入到全连接层中进行编码，来生成空间关系矩阵，输出的维度H与多头注意力模块中注意头的数量匹配(在实现中，比如H＝4)，然后将空间关系矩阵与多头自注意力模块的每个头生成的相似矩阵(即所谓的注意力图)相加。Relation encoding module: can encode the relative distance between any two object proposals to capture complex object relationships, not only encoding the Euclidean distance between any two centers of object proposals (i.e. distance Dist ∈ M × M ×1), and encodes three pairs of distances along the x, y, z directions between any two centers proposed by the object (i.e. distance [Dx, Dy, Dz] ∈ M × M × 3) to better capture different directions object relation on , where M is the number of initial object proposals. Then, it is spliced into a spatial proximity matrix (Dx, Dy, Dz, and Dist) and aggregated along the channel dimension, input into the fully connected layer for encoding, to generate a spatial relationship matrix, and the output dimension H is the same as the attention head in the multi-head attention module. The number of matches (in the implementation, say H = 4), and then the spatial relationship matrix is summed with the similarity matrix (the so-called attention map) generated by each head of the multi-head self-attention module.

4.将增强对象建议输入描述生成模块，生成对应描述。以下是描述生成模块的详细结构和工作原理。4. Input the enhanced object suggestion into the description generation module to generate the corresponding description. The following is the detailed structure and working principle of the description generation module.

描述生成头主要由一层交叉注意力层组成。首先需要选择需要生成描述语句的对象建议，在测试阶段，可以是使用场景中的所有对象建议(在非最大化抑制(NMS)过程之后)逐个作为输入，然后使用循环网络结构逐步生成描述生成的每个单词。然后对于每一个对象，将多头交叉注意力模块输出的隐藏特征和前一个单词的单词特征(在训练阶段是真实的真值单词，在测试阶段是之前的预测单词，此处为了避免训练过程与测试过程的较大区别，在训练时采用了自回归策略，具体来说比如就是将10％的真值单词替换为预测词语)与当前的对象建议特征进行融合，作为交叉注意力模块输入的Query。The description generation head mainly consists of one cross-attention layer. First, you need to select the object proposals that need to generate description sentences. In the testing phase, you can use all the object proposals in the scene (after the non-maximization suppression (NMS) process) as input one by one, and then use the recurrent network structure to gradually generate the description generated. each word. Then for each object, the hidden features output by the multi-head cross-attention module and the word features of the previous word (the real ground-truth word in the training phase and the previous predicted word in the testing phase, here in order to avoid the training process with The big difference in the testing process is that an autoregressive strategy is used during training, specifically, 10% of the true words are replaced with predicted words) and the current object suggestion features are fused as the Query input by the cross-attention module. .

如附图3所示，对于交叉注意力模块输入的Key和Value，使用K最近邻策略，根据其在三维坐标空间中的中心距离，选择距离目标对象最近的前K个对象，从而过滤掉场景中相关性较小的对象。选定的对象建议的特征被用作多头交叉注意力模块的Key和Value。As shown in Figure 3, for the Key and Value input by the cross-attention module, the K nearest neighbor strategy is used, and the top K objects closest to the target object are selected according to their center distance in the three-dimensional coordinate space, thereby filtering out the scene objects that are less relevant. The features proposed by the selected objects are used as the Key and Value of the multi-head cross-attention module.

最后，多头交叉注意模块后面是一个全连接层和一个简单的单词预测模块，以逐个方式预测标题中的每个单词。Finally, the multi-head cross-attention module is followed by a fully connected layer and a simple word prediction module to predict each word in the title in a one-by-one manner.

5.将增强对象建议和步骤1中的输入的对应的文本数据输入视觉定位模块，找到描述对应的对象建议。以下是视觉定位模块的详细结构和工作原理：5. Input the enhanced object suggestion and the text data corresponding to the input in step 1 into the visual positioning module, and find the description corresponding to the object suggestion. The following is the detailed structure and working principle of the visual positioning module:

3D视觉定位任务是根据语言描述定位感兴趣的对象，视觉定位头主要关注给定语言描述和检测到的对象建议之间的匹配。The 3D visual localization task is to locate objects of interest based on language descriptions, and the visual localization head mainly focuses on matching between a given language description and detected object proposals.

如附图3所示，使用交叉注意力模块定位语言中描述的对象建议。交叉注意力模块的输入Key和Value是用输入的文本语言描述生成的。具体来说，可以使用预训练的GloVE(Global Vectors for Word Representation)模型以及门控循环单元(GRU，GatedRecurrent Unit)模型来提取出文本的特征。GRU的输出的单词特征形成了Key和Value。此外，GRU还生成了一个全局语言特征来预测每个句子所描述物体的类别。对象建议的特征被用作Query输入。最后，使用基本的分类器生成每个对象建议的置信度得分，并将预测得分最高的对象建议作为最终的视觉定位结果。As shown in Fig. 3, the object proposals described in the localization language are located using a cross-attention module. The input Key and Value of the cross-attention module are generated using the input text language description. Specifically, a pre-trained GloVE (Global Vectors for Word Representation) model and a Gated Recurrent Unit (GRU, Gated Recurrent Unit) model can be used to extract text features. The word features of the output of the GRU form Key and Value. In addition, GRU also generates a global language feature to predict the class of the object described by each sentence. The features proposed by the object are used as Query input. Finally, a basic classifier is used to generate a confidence score for each object proposal, and the object proposal with the highest predicted score is used as the final visual localization result.

以下是本发明提供的方法(Ours)在多个数据集(ScanRefer，Scan2Cap)的效果，本发明提供的方法对比其他的方法达到了目前最佳的性能(加粗)：The following is the effect of the method (Ours) provided by the present invention in multiple data sets (ScanRefer, Scan2Cap). Compared with other methods, the method provided by the present invention has achieved the current best performance (bold):

表1 ScanRefer数据集上视觉定位结果Table 1 Visual localization results on ScanRefer dataset

表2 Scan2Cap数据集上描述生成结果Table 2 describes the generation results on the Scan2Cap dataset

本发明提供的方法，通过对三维点云视觉定位和描述生成任务的联合训练，使得模型既可以从视觉定位任务中充分学习到物体间的关系特征，又能从描述生成任务中学习到物体自身的细粒度特征；将需要三维点云视觉定位和描述生成任务的场景，采用训练后的联合框架，实现场景内物体的描述生成和视觉定位；该方法可有助于AR/VR产业发展，这两者的联合应用可以使得模型可以适应更多的实际应用场景，方便人们的生活。The method provided by the present invention enables the model not only to fully learn the relationship features between objects from the visual positioning task, but also to learn the object itself from the description generation task through the joint training of the three-dimensional point cloud visual positioning and the description generation task. The fine-grained features of the scene; the scene that will require 3D point cloud visual positioning and description generation tasks, using a joint framework after training to achieve description generation and visual positioning of objects in the scene; this method can help the development of the AR/VR industry, which The combined application of the two can make the model adapt to more practical application scenarios and facilitate people's lives.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention. Thus, provided that these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include these modifications and variations.

Claims

1. A three-dimensional description generation and visual positioning unified method based on deep learning is characterized in that a joint framework is used for realizing three-dimensional description generation and joint information output of point cloud visual positioning in a complex scene; the joint frame includes: the system comprises a target detection module, a feature perception enhancement module, a description generation module and a visual positioning module; the method comprises the following steps:

acquiring related point cloud data and corresponding text data of a preset scene, and dividing the point cloud data and the corresponding text data into a training set, a cross validation set and a test set according to a preset proportion;

inputting the related point cloud data in the training set into the target detection module, positioning objects in a scene and generating an initial object suggestion;

inputting the initial object suggestion into the feature perception enhancement module to generate a corresponding enhancement target suggestion;

inputting the enhancement target suggestions into a description generation module and a visual positioning module respectively; the description generation module converts the enhanced target suggestions into text features and generates description sentences corresponding to the object suggestions; the visual positioning module fuses the corresponding text data and the enhancement target suggestion to generate the position of the described object;

performing iterative training on the combined framework, and performing verification and test by adopting the cross verification set and the test set;

and (3) adopting a trained combined frame to realize the description generation and the visual positioning of objects in the scene for the scene needing the three-dimensional point cloud visual positioning and description generation task.

2. The method of claim 1, wherein the target detection module employs a VoteNet object detection module, predicts distances from a center point to each measurement, encodes point clouds, locates objects in a scene, and generates an initial object suggestion.

3. The unified method for three-dimensional description generation and visual localization based on deep learning of claim 1, wherein the feature perception enhancement module is composed of two stacked multi-headed self-attention layers, which contain additional attribute coding module and relationship coding module;

the attribute coding module and the relation coding module are both composed of a plurality of full connection layers; the attribute coding module is used for coding the characteristics of the object, wherein the characteristics comprise: color, size and shape information; the relation coding module is used for coding the distance information between every two objects; the attribute codes and the relationship codes constitute enhanced target suggestions for the object.

4. The unified method for three-dimensional description generation and visual localization based on deep learning of claim 1, wherein the description generation module uses a layer-multi-head cross attention module to fuse the enhanced target suggestions of the target object and other target objects, and then uses a fully-connected layer and word prediction module to generate each text in the description sentence one by one.

5. The method as claimed in claim 4, wherein the description generation module selects the top K objects nearest to the target object of the object as other target objects except the target object by using a K nearest neighbor strategy.

6. The unified method for deep learning based three-dimensional description generation and visual localization as claimed in claim 1, wherein the visual localization module is composed of a layer of multi-head cross attention, and finally uses a classifier to generate confidence score of each object suggestion, and takes the object with highest predicted score as the final result.