WO2020227933A1 - 一种六自由度姿态估计方法、装置及计算机可读存储介质 - Google Patents

一种六自由度姿态估计方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2020227933A1
WO2020227933A1 PCT/CN2019/086883 CN2019086883W WO2020227933A1 WO 2020227933 A1 WO2020227933 A1 WO 2020227933A1 CN 2019086883 W CN2019086883 W CN 2019086883W WO 2020227933 A1 WO2020227933 A1 WO 2020227933A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
estimation
target
target object
candidate
Prior art date
Application number
PCT/CN2019/086883
Other languages
English (en)
French (fr)
Inventor
邹文斌
卓圣楷
庄兆永
吴迪
李霞
徐晨
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2019/086883 priority Critical patent/WO2020227933A1/zh
Publication of WO2020227933A1 publication Critical patent/WO2020227933A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present invention relates to the technical field of spatial positioning, in particular to a six-degree-of-freedom attitude estimation method, device and computer-readable storage medium.
  • the perception of the surrounding environment of the vehicle is the core technology in the autonomous driving system.
  • the perception of the surrounding environment of the vehicle includes images such as pedestrian detection, lane detection, lane line detection, vehicle detection, pedestrian detection, etc. (surrounding environment)
  • Medium target detection and semantic segmentation technology Medium target detection and semantic segmentation technology.
  • Vehicle multi-degree-of-freedom pose estimation is an extension of traditional object detection and semantic segmentation in three-dimensional space. Its main task is to accurately locate and identify all vehicle objects in a vehicle driving video sequence or single frame image, and at the same time, the detected vehicle Perform multi-degree-of-freedom pose estimation in three-dimensional space.
  • a multi-stage vehicle six-degree-of-freedom attitude estimation network combining deep learning and geometric constraint methods is usually used. This method is divided into two steps to realize the six-degree-of-freedom attitude estimation of the car.
  • the deep neural network is used to detect the vehicle in the input monocular RGB image, and the length, width, height and three-degree-of-freedom direction of the detected vehicle are estimated at the same time, and then the geometric constraint relationship is used to calculate the three-dimensional space of the vehicle in the actual driving scene Three degrees of freedom position.
  • the above-mentioned deep learning-based multi-degree-of-freedom pose estimation method can realize the perception of the surrounding environment of the target control object and has achieved good results in related scenarios, the above-mentioned model still has the cumbersome training and testing process and cannot achieve end-to-end Training and testing, slow attitude estimation and other defects, which restrict the application of automation technology in scenarios with high control accuracy and real-time requirements, and thus have greater limitations in practical applications.
  • the main purpose of the embodiments of the present invention is to provide a six-degree-of-freedom attitude estimation method, device, and computer-readable storage medium, which can at least solve the problem of using a combination of deep learning and geometric constraints in related technologies to perceive the surrounding environment of a target control object.
  • the training and testing process of the model is cumbersome, end-to-end training and testing cannot be realized, and the pose estimation speed of the objects in the surrounding environment is slow.
  • the first aspect of the embodiments of the present invention provides a six-degree-of-freedom attitude estimation method, which is applied to an overall convolutional neural network including a target detection main network, a first estimation branch network, and a second estimation branch network.
  • Methods include:
  • Input the target image to the target detection main network control the target detection main network to perform feature extraction on the target image to obtain a feature map, and then detect the category of each candidate object in the target image based on the feature map, and The two-dimensional bounding box information of each candidate object in the pixel coordinate system corresponding to the target image;
  • Control the second estimation branch network to estimate the three-dimensional position of the target object in the camera coordinate system based on the two-dimensional bounding box information and feature map corresponding to the target object, and then use the three-dimensional position and the The six-degree-of-freedom attitude information of the target object is obtained in the three-dimensional direction.
  • the second aspect of the embodiments of the present invention provides a six-degree-of-freedom attitude estimation device, which is applied to an overall convolutional neural network including a target detection main network, a first estimation branch network, and a second estimation branch network.
  • the device includes:
  • the detection module is used to input the target image to the target detection main network, control the target detection main network to perform feature extraction on the target image to obtain a feature map, and then detect each candidate in the target image based on the feature map
  • the first estimation module is used to obtain the feature map corresponding to the target object of the preset category among all the candidate objects, and input it to the first estimation branch network, and control the first estimation branch network to estimate that the target object is in the camera The three-dimensional direction in the coordinate system;
  • the second estimation module is used to control the second estimation branch network to estimate the three-dimensional position of the target object in the camera coordinate system based on the two-dimensional bounding box information and feature map corresponding to the target object, and then use The three-dimensional position and the three-dimensional direction obtain six-degree-of-freedom attitude information of the target object.
  • a third aspect of the embodiments of the present invention provides an electronic device, which includes: a processor, a memory, and a communication bus;
  • the communication bus is used to implement connection and communication between the processor and the memory
  • the processor is configured to execute one or more programs stored in the memory to implement the steps of any one of the six-degree-of-freedom attitude estimation methods described above.
  • a fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and the one or more programs can be processed by one or more The device executes to implement the steps of any of the six-degree-of-freedom attitude estimation methods described above.
  • the target detection main network is controlled to extract features of the input target image, and then detect and output the category of each candidate object in the target image, and The two-dimensional bounding box information of each candidate object; obtain the feature map corresponding to the preset category target object among all the candidate objects, and input it into the first estimation branch network, and control the first estimation branch network to estimate the target object in the camera coordinate system Control the second estimation branch network to estimate the three-dimensional position of the target object in the camera coordinate system based on the two-dimensional bounding box information and feature map of the target object, and then use the three-dimensional position and three-dimensional direction to obtain the six-degree-of-freedom attitude information of the target object .
  • different network branches respectively estimate the three-dimensional direction and three-dimensional position of the target object, realize end-to-end six-degree-of-freedom attitude estimation of the object in the surrounding environment of the target object, and effectively improve the calculation speed and accuracy of the calculation.
  • FIG. 1 is a schematic diagram of the basic flow of the six-degree-of-freedom attitude estimation method provided by the first embodiment of the present invention
  • FIG. 2 is a schematic diagram of the overall network framework provided by the first embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of the target detection method provided by the first embodiment of the present invention.
  • FIG. 4 is a schematic diagram of multi-scale feature extraction provided by the first embodiment of the present invention.
  • FIG. 5 is a schematic diagram of candidate region extraction provided by the first embodiment of the present invention.
  • FIG. 6 is a schematic diagram of the pooling of feature maps of candidate regions provided by the first embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a six-degree-of-freedom attitude estimation device provided by a second embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by a third embodiment of the present invention.
  • this embodiment proposes a six-degree-of-freedom attitude estimation method, which is applied to an overall convolutional neural network including a target detection main network, a first estimation branch network, and a second estimation branch network, as shown in Figure 1
  • the six-degree-of-freedom attitude estimation method proposed in this embodiment includes the following steps:
  • Step 101 Input the target image to the target detection main network, and control the target detection main network to perform feature extraction on the target image to obtain a feature map, and then detect the category of each candidate object in the target image based on the feature map, and each candidate object corresponds to the target
  • the two-dimensional bounding box information in the pixel coordinate system of the image Input the target image to the target detection main network, and control the target detection main network to perform feature extraction on the target image to obtain a feature map, and then detect the category of each candidate object in the target image based on the feature map, and each candidate object corresponds to the target
  • the two-dimensional bounding box information in the pixel coordinate system of the image is provided.
  • the target detection main network of this embodiment extracts features of the input image, and then detects and outputs the category of the object in the image and the two-dimensional bounding box of the object.
  • the target image in this embodiment can be a monocular RGB image collected by a monocular camera.
  • the candidate object is also an object of interest, and its type can be selected according to specific application scenarios. In the driving application scenario, candidate objects may include pedestrians, vehicles, and so on.
  • FIG. 2 is a schematic diagram of the overall network framework provided in this embodiment.
  • the box identified by A in Figure 2 indicates a target detection main network provided in this embodiment.
  • the target detection main network includes multi-scale feature extraction.
  • this embodiment provides a target detection method.
  • a schematic flow chart of the target detection method provided in this embodiment includes the following steps:
  • Step 301 Perform multi-scale feature extraction on the target image using a multi-scale feature extraction network to obtain feature maps of different scales;
  • Step 302 Use the candidate region extraction network to extract feature maps corresponding to preset candidate regions from feature maps of different scales;
  • Step 303 Use the candidate region feature map pooling layer to perform a pooling operation on all candidate region feature maps, and unify the size of all candidate region feature maps;
  • Step 304 Input the feature map of the candidate area of uniform size into the object classification and bounding box regression fully connected layer to perform candidate area classification detection and bounding box regression to obtain the category of the candidate object in each candidate area, and each candidate object corresponds to the target The two-dimensional bounding box information in the pixel coordinate system of the image.
  • the main target detection network in this embodiment is composed of four modules: a multi-scale feature extraction network, a candidate region extraction network, a candidate region feature map pooling layer, object classification and bounding box regression fully connected layer.
  • a multi-scale feature extraction network taking the automated driving of vehicles as an example, the surrounding vehicles move in the camera coordinate system in a large range during vehicle driving, resulting in a large difference in the size of the image of vehicles at different positions in the camera coordinate system in the pixel coordinate system.
  • This embodiment uses a multi-scale feature extraction network to extract input image features, and uses the inherent multi-scale and multi-level pyramid structure in deep convolutional neural networks to extract different-scale features of target objects in a single-size input image, so that the detection system has a certain Scale invariance can effectively detect objects of different sizes in the image.
  • the multi-scale feature extraction network is a multi-scale feature extraction network based on ResNet-101
  • the multi-scale feature extraction network based on ResNet-101 includes bottom-up deep semantics
  • the target image is input to the self
  • the semantic features of each layer extracted by the bottom-up deep semantic feature extraction path are convolved with the 1 ⁇ 1 convolution kernel, and they are connected with the semantic features of the same layer in the top-down deep semantic feature fusion path through horizontal connection. Add fusion to obtain feature maps of different scales.
  • the location details of the underlying semantics are utilized through the horizontal connection, which makes the fusion feature more refined.
  • a candidate region feature extraction network is used to select candidate regions (that is, regions of interest) in the multi-scale feature map.
  • the candidate region feature extraction network is a fully convolutional neural network.
  • a window of size n ⁇ n is used to slide on the feature map.
  • the point in the window is the anchor point of 3 different sizes and 3 different aspect ratios of the anchor point frame generation
  • the feature map in each anchor point frame area of the image feature map is mapped into a 256-dimensional feature vector, and then the feature vector Input the classification fully connected layer and the bounding box regression fully connected layer respectively to obtain the position of the candidate region corresponding to the anchor point box in the input image and the probability (ie, confidence) that the region is an object.
  • the candidate region extraction network Since the sliding mechanism and anchor boxes of different sizes and aspect ratios are used in the process of candidate region extraction, the candidate region extraction network has both translation invariance and scale invariance to the target object in the input image.
  • this embodiment uses the idea of a spatial pyramid pooling layer to design a feature map pooling layer for candidate regions.
  • a spatial pyramid pooling layer to design a feature map pooling layer for candidate regions.
  • the object classification and bounding box regression fully connected layer in this embodiment includes two sub-modules, the object classification fully connected layer and the object bounding box regressor.
  • the output of the candidate region feature map pooling layer After the feature map is mapped by two layers of 1024-dimensional fully connected layers, the softmax function is used to classify candidate objects such as pedestrians, bicycles, cars, motorcycles, etc., and also the two-dimensional bounding box position of the candidate object in the image Make an estimate.
  • Step 102 Obtain the feature maps corresponding to the preset category target objects among all the candidate objects, and input them into the first estimation branch network, and control the first estimation branch network to estimate the three-dimensional direction of the target object in the camera coordinate system.
  • the original pooled candidate area feature map is input to the first Estimate the branch network to estimate the three-dimensional direction of the target object model in the camera coordinate system (the actual driving environment).
  • controlling the first estimation branch network to estimate the three-dimensional direction of the target object in the camera coordinate system includes: controlling the classification and three-dimensional direction estimation branch network to estimate the subcategory of the target object and the three-dimensional direction of the target object in the camera coordinate system.
  • the feature map of the corresponding area of the target object can be mapped through two layers of 100-dimensional fully connected layers, and then the softmax function can be used to perform subcategory detection of the "target object candidate area", and the target object model is in the camera coordinate system. (Actual driving environment) in three-dimensional direction estimation.
  • Step 103 Control the second estimation branch network to estimate the three-dimensional position of the target object in the camera coordinate system based on the two-dimensional bounding box information and feature map corresponding to the target object, and then use the three-dimensional position and the three-dimensional direction to obtain the six degrees of freedom of the target object Posture information.
  • the second estimation branch network fuses the information provided by the first estimation branch network and calculates the location of each target object.
  • the three-dimensional position in the camera coordinate system (the actual driving environment) to achieve end-to-end six-degree-of-freedom attitude estimation of the target object.
  • the second estimation branch network is based on the two-dimensional bounding box information and feature map corresponding to the target object, and when estimating the three-dimensional position of the target object in the camera coordinate system, it can be First convert the two-dimensional bounding box information into the bounding box information in the camera coordinate system, and convert the regional feature map into a vector of a specific dimension through the first estimation network branch, and then input the converted information into the second estimation branch network, The transformed bounding box information and regional feature information are fused in a cascaded manner to output the three-dimensional position, and then form the six-degree-of-freedom attitude information of the target object with the three-dimensional direction output by the first estimation branch network.
  • this process is implemented end-to-end, it can greatly increase the calculation speed and avoid the error transmission of multi-stage processing, thus ensuring the rate and accuracy of the target object attitude estimation, thereby ensuring the timeliness of the system's perception of the surrounding environment And accuracy, which will greatly improve the decision-making and control performance of automated control.
  • the six-degree-of-freedom posture information obtained in this embodiment can be used to visualize the target to obtain the visualized result, and the target can be more intuitively presented to the user.
  • the box identified by C in Figure 2 indicates a second estimation branch network provided by this embodiment, corresponding to the first estimation branch network being a classification and three-dimensional direction estimation branch network and also outputting target objects
  • using the three-dimensional position and the three-dimensional direction to obtain the six-degree-of-freedom attitude information of the target object includes: using the three-dimensional position, the three-dimensional direction and the subcategory of the target object to obtain the six-degree-of-freedom pose information of the target object in each subcategory.
  • the target boundary box position feature of the fully connected layer of the target detection main network is input into the two layers of 100-dimensional fully connected layer, and the two-dimensional bounding box information of the target object in the image is used for mapping, and the classification and the three-dimensional direction estimation are merged at the same time.
  • the subcategory of the target object of the branch network, the three-dimensional direction of the target object and other information are used to improve the calculation accuracy, and the three-dimensional position of the target object in the camera coordinate system is calculated.
  • the loss function of the overall convolutional neural network of this embodiment is:
  • the target detection main network is controlled to perform feature extraction on the input target image, and then detect and output the category of each candidate object in the target image, and the two-dimensional bounding box of each candidate object Information; Obtain the feature map corresponding to the preset category target object in all candidate objects, and input it into the first estimation branch network, control the first estimation branch network to estimate the three-dimensional direction of the target object in the camera coordinate system; control the second estimation The branch network estimates the three-dimensional position of the target object in the camera coordinate system based on the two-dimensional bounding box information of the target object and the feature map, and then uses the three-dimensional position and the three-dimensional direction to obtain the six-degree-of-freedom attitude information of the target object.
  • different network branches respectively estimate the three-dimensional direction and three-dimensional position of the target object, realize end-to-end six-degree-of-freedom attitude estimation of the object in the surrounding environment of the target object, and effectively improve the calculation speed and accuracy of the calculation.
  • this embodiment shows a six-degree-of-freedom attitude estimation device, which is applied to an overall convolutional neural network including a target detection main network, a first estimation branch network, and a second estimation branch network.
  • the six-degree-of-freedom attitude estimation device of this embodiment includes:
  • the detection module 701 is used to input the target image to the target detection main network, control the target detection main network to perform feature extraction on the target image to obtain a feature map, and then detect the category of each candidate object in the target image based on the feature map, and the location of each candidate object Corresponding to the two-dimensional bounding box information in the pixel coordinate system of the target image;
  • the first estimation module 702 is used to obtain the feature map corresponding to the target object of the preset category among all the candidate objects, and input it into the first estimation branch network, and control the first estimation branch network to estimate the three-dimensionality of the target object in the camera coordinate system direction;
  • the second estimation module 703 is used to control the second estimation branch network to estimate the 3D position of the target object in the camera coordinate system based on the 2D bounding box information and feature map corresponding to the target object, and then use the 3D position and 3D direction to obtain the target Six degrees of freedom posture information of the object.
  • the target detection main network performs feature extraction on the input image, and then detects and outputs the category of the object in the image and the two-dimensional bounding box of the object. Then when the fully connected layer of the target detection main network predicts that the object in the candidate area is a preset type of target object (such as a car), the feature map corresponding to the target object is input to the first estimation branch network, and the target object model is The three-dimensional direction in the camera coordinate system (the actual driving environment) is estimated.
  • a preset type of target object such as a car
  • the second estimation branch network After the first estimation branch network estimates the three-dimensional direction of the target object in the camera coordinate system (the actual driving environment), the second estimation branch network fuses the information provided by the first estimation branch network and calculates that each target object is in the camera coordinate system (Actual driving environment) in the three-dimensional position, so as to achieve end-to-end six-degree-of-freedom attitude estimation of the target object. Since this process is implemented end-to-end, it can greatly increase the calculation speed and avoid the error transmission of multi-stage processing, thus ensuring the rate and accuracy of the target object attitude estimation, thereby ensuring the timeliness of the system's perception of the surrounding environment And accuracy, which will greatly improve the decision-making and control performance of automated control.
  • the main target detection network includes a multi-scale feature extraction network, a candidate region extraction network, a candidate region feature map pooling layer, and an object classification and bounding box regression fully connected layer; correspondingly, the detection module 701 It is specifically used to input the target image into the target detection main network, and use the multi-scale feature extraction network to perform multi-scale feature extraction on the target image to obtain feature maps of different scales; use the candidate region extraction network to extract the corresponding features from the feature maps of different scales The feature map of the preset candidate area; use the candidate area feature map pooling layer to pool all the candidate area feature maps, and unify the size of all the candidate area feature maps; input the uniform size of the candidate area feature map to the object classification The fully connected layer with bounding box regression performs candidate area classification detection and bounding box regression to obtain the category of candidate objects in each candidate area and the two-dimensional bounding box information of each candidate object in the pixel coordinate system corresponding to the target image.
  • the multi-scale feature extraction network is a multi-scale feature extraction network based on ResNet-101
  • the multi-scale feature extraction network based on ResNet-101 includes a bottom-up deep semantic feature extraction path And the top-down deep semantic feature fusion path;
  • the detection module 701 uses the multi-scale feature extraction network to perform multi-scale feature extraction on the target image to obtain feature maps of different scales, it is specifically used to input the target image to the self
  • the semantic features of each layer extracted by the bottom-up deep semantic feature extraction path are convolved with the 1 ⁇ 1 convolution kernel, and they are connected with the semantic features of the same layer in the top-down deep semantic feature fusion path through horizontal connection. Add fusion to obtain feature maps of different scales.
  • the first estimation branch network is: a classification and three-dimensional direction estimation branch network; correspondingly, the first estimation module 702 is specifically configured to obtain all candidate objects corresponding to the preset category target object
  • the feature map is input to the classification and 3D direction estimation branch network, which controls the classification and 3D direction estimation branch network to estimate the subcategory of the target object and the 3D direction of the target object in the camera coordinate system.
  • the second estimation module 703 is specifically configured to control the second estimation branch network to estimate the three-dimensional position of the target object in the camera coordinate system based on the two-dimensional bounding box information and feature map corresponding to the target object, and then use the three-dimensional position, three-dimensional direction and target Object sub-categories, obtain the six-degree-of-freedom attitude information of the target object in each sub-category.
  • the six-degree-of-freedom attitude estimation method in the foregoing embodiment can be implemented based on the six-degree-of-freedom attitude estimation device provided in this embodiment, and those of ordinary skill in the art can clearly understand that it is convenient and concise for description.
  • the specific working process of the six-degree-of-freedom attitude estimation device described in this embodiment reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
  • the target detection main network is controlled to perform feature extraction on the input target image, and then detect and output the category of each candidate object in the target image and the two-dimensional bounding box information of each candidate object ; Obtain the feature map corresponding to the preset category target object among all the candidate objects, and input it into the first estimation branch network, control the first estimation branch network to estimate the three-dimensional direction of the target object in the camera coordinate system; control the second estimation branch The network estimates the three-dimensional position of the target object in the camera coordinate system based on the two-dimensional bounding box information and the feature map of the target object, and then uses the three-dimensional position and the three-dimensional direction to obtain the six-degree-of-freedom attitude information of the target object.
  • different network branches respectively estimate the three-dimensional direction and three-dimensional position of the target object, realize end-to-end six-degree-of-freedom attitude estimation of the object in the surrounding environment of the target object, and effectively improve the calculation speed and accuracy of the calculation.
  • This embodiment provides an electronic device. As shown in FIG. 8, it includes a processor 801, a memory 802, and a communication bus 803.
  • the communication bus 803 is used to implement connection and communication between the processor 801 and the memory 802; processing
  • the processor 801 is configured to execute one or more computer programs stored in the memory 802 to implement at least one step in the six-degree-of-freedom attitude estimation method in the first embodiment.
  • This embodiment also provides a computer-readable storage medium, which is included in any method or technology for storing information (such as computer-readable instructions, data structures, computer program modules, or other data). Volatile or non-volatile, removable or non-removable media.
  • Computer readable storage media include but are not limited to RAM (Random Access Memory), ROM (Read-Only Memory, read-only memory), EEPROM (Electrically Erasable Programmable read only memory, charged Erasable Programmable Read-Only Memory) ), flash memory or other storage technology, CD-ROM (Compact Disc Read-Only Memory), digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, Or any other medium that can be used to store desired information and can be accessed by a computer.
  • the computer-readable storage medium in this embodiment may be used to store one or more computer programs, and the stored one or more computer programs may be executed by a processor to implement at least one step of the method in the first embodiment.
  • This embodiment also provides a computer program, which can be distributed on a computer-readable medium and executed by a computer-readable device to implement at least one step of the method in the first embodiment; and in some cases At least one of the steps shown or described can be performed in a different order from the order described in the foregoing embodiment.
  • This embodiment also provides a computer program product, including a computer readable device, and the computer readable device stores the computer program as shown above.
  • the computer-readable device in this embodiment may include the computer-readable storage medium as shown above.
  • communication media usually contain computer-readable instructions, data structures, computer program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium. Therefore, the present invention is not limited to any specific combination of hardware and software.

Abstract

根据本发明实施例公开的一种六自由度姿态估计方法、装置及计算机可读存储介质,控制目标检测主网络对输入图像进行特征提取,然后检测并输出图像中各候选物体的类别以及二维边界框信息;将所有候选物体中的预设类别目标物体的特征图输入至第一估计分支网络,估计出目标物体在相机坐标系中的三维方向;控制第二估计分支网络基于目标物体的二维边界框信息以及特征图估计目标物体在相机坐标系中的三维位置,然后利用三维位置与三维方向得到目标物体的六自由度姿态信息。通过本发明的实施,由不同网络分支分别估计目标物体的三维方向和三维位置,实现端到端的目标物体周围环境中对象的六自由度姿态估计,有效提升了运算速度以及运算的准确性。

Description

一种六自由度姿态估计方法、装置及计算机可读存储介质 技术领域
本发明涉及空间定位技术领域,尤其涉及一种六自由度姿态估计方法、装置及计算机可读存储介质。
背景技术
随着人工智能技术的快速发展,自动化技术例如车辆自动驾驶、智能机器人控制等越来越得到业界的重视,其中,目标控制对象周围环境的感知是自动化控制操作的基础。
以车辆自动驾驶为例,车辆周围环境感知是自动驾驶系统中最核心的技术,车辆周围环境感知包括行人道检测、车行道检测、车道线检测、车辆检测、行人检测等图像(周围环境)中目标检测与语义分割技术。车辆多自由度姿态估计是传统目标检测与语义分割在三维空间的扩展,其主要任务是精准定位和识别出车辆驾驶视频序列或单帧图像中的所有车辆对象,且同时对所检测到的车辆进行三维空间中的多自由度姿态估计。目前在进行车辆的多自由度姿态估计时,通常所采用的是结合深度学习与几何约束方法的多阶段车辆六自由度姿态估计网络,该方法分为两步实现汽车的六自由度姿态估计,首先通过深度神经网络对输入的单目RGB图像中的车辆进行检测,同时对所检测的车辆进行长宽高和三自由度方向估计,然后再利用几何约束关系计算出车辆在实际驾驶场景三维空间中三自由度位置。
尽管上述基于深度学习的多自由度姿态估计方法可以实现目标控制对象周围环境的感知,并在相关场景下取得了不错的成绩,但上述模型仍存在训练与测试过程比较繁琐、不能实现端到端的训练和测试、姿态估计速度慢等缺陷,这制约了自动化技术在控制准确性要求和实时性要求高的场景下的应用,从而在实际应用中具有较大的局限性。
发明概述
技术问题
本发明实施例的主要目的在于提供一种六自由度姿态估计方法、装置及计算机可读存储介质,至少能够解决相关技术中采用结合深度学习与几何约束的方法进行目标控制对象周围环境的感知时,模型的训练与测试过程比较繁琐、不能实现端到端的训练和测试、周围环境中对象的姿态估计速度慢的问题。
问题的解决方案
技术解决方案
为实现上述目的,本发明实施例第一方面提供了一种六自由度姿态估计方法,应用于包括目标检测主网络、第一估计分支网络以及第二估计分支网络的整体卷积神经网络,该方法包括:
将目标图像输入至所述目标检测主网络,控制所述目标检测主网络对所述目标图像进行特征提取得到特征图,然后基于所述特征图检测所述目标图像中各候选物体的类别,以及所述各候选物体在对应于所述目标图像的像素坐标系中的二维边界框信息;
获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至所述第一估计分支网络,控制所述第一估计分支网络估计所述目标物体在相机坐标系中的三维方向;
控制所述第二估计分支网络基于所述目标物体所对应的二维边界框信息以及特征图,估计所述目标物体在所述相机坐标系中的三维位置,然后利用所述三维位置以及所述三维方向得到所述目标物体的六自由度姿态信息。
为实现上述目的,本发明实施例第二方面提供了一种六自由度姿态估计装置,应用于包括目标检测主网络、第一估计分支网络以及第二估计分支网络的整体卷积神经网络,该装置包括:
检测模块,用于将目标图像输入至所述目标检测主网络,控制所述目标检测主网络对所述目标图像进行特征提取得到特征图,然后基于所述特征图检测所述目标图像中各候选物体的类别,以及所述各候选物体在对应于所述目标图像的像素坐标系中的二维边界框信息;
第一估计模块,用于获取所有候选物体中的预设类别目标物体所对应的特征图 ,并输入至所述第一估计分支网络,控制所述第一估计分支网络估计所述目标物体在相机坐标系中的三维方向;
第二估计模块,用于控制所述第二估计分支网络基于所述目标物体所对应的二维边界框信息以及特征图,估计所述目标物体在所述相机坐标系中的三维位置,然后利用所述三维位置以及所述三维方向得到所述目标物体的六自由度姿态信息。
为实现上述目的,本发明实施例第三方面提供了一种电子装置,该电子装置包括:处理器、存储器和通信总线;
所述通信总线用于实现所述处理器和存储器之间的连接通信;
所述处理器用于执行所述存储器中存储的一个或者多个程序,以实现上述任意一种六自由度姿态估计方法的步骤。
为实现上述目的,本发明实施例第四方面提供了一种计算机可读存储介质,该计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现上述任意一种六自由度姿态估计方法的步骤。发明的有益效果
有益效果
根据本发明实施例提供的六自由度姿态估计方法、装置及计算机可读存储介质,控制目标检测主网络对输入的目标图像进行特征提取,然后检测并输出目标图像中各候选物体的类别,以及各候选物体的二维边界框信息;获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至第一估计分支网络,控制第一估计分支网络估计目标物体在相机坐标系中的三维方向;控制第二估计分支网络基于目标物体的二维边界框信息以及特征图估计目标物体在相机坐标系中的三维位置,然后利用三维位置与三维方向得到目标物体的六自由度姿态信息。通过本发明的实施,由不同网络分支分别估计目标物体的三维方向和三维位置,实现端到端的目标物体周围环境中对象的六自由度姿态估计,有效提升了运算速度以及运算的准确性。
本发明其他特征和相应的效果在说明书的后面部分进行阐述说明,且应当理解,至少部分效果从本发明说明书中的记载变的显而易见。
对附图的简要说明
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明第一实施例提供的六自由度姿态估计方法的基本流程示意图;
图2为本发明第一实施例提供的整体网络框架示意图;
图3为本发明第一实施例提供的目标检测方法的流程示意图;
图4为本发明第一实施例提供的多尺度特征提取示意图;
图5为本发明第一实施例提供的候选区域提取示意图;
图6为本发明第一实施例提供的候选区域特征图池化示意图;
图7为本发明第二实施例提供的六自由度姿态估计装置的结构示意图;
图8为本发明第三实施例提供的电子装置的结构示意图。
发明实施例
本发明的实施方式
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而非全部实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
第一实施例:
为了解决相关技术中采用结合深度学习与几何约束的方法进行目标控制对象周围环境的感知时,模型的训练与测试过程比较繁琐、不能实现端到端的训练和测试、周围环境中对象的姿态估计速度慢的技术问题,本实施例提出了一种六自由度姿态估计方法,应用于包括目标检测主网络、第一估计分支网络以及第二估计分支网络的整体卷积神经网络,如图1所示为本实施例提供的六自由度姿态估计方法的基本流程示意图,本实施例提出的六自由度姿态估计方法包括以 下的步骤:
步骤101、将目标图像输入至目标检测主网络,控制目标检测主网络对目标图像进行特征提取得到特征图,然后基于特征图检测目标图像中各候选物体的类别,以及各候选物体在对应于目标图像的像素坐标系中的二维边界框信息。
具体的,本实施例的目标检测主网络对输入的图像进行特征提取,然后检测并输出图像中物体的类别和物体的二维边界框。应当说明的是,本实施例中的目标图像可以为通过单目摄像头采集的单目RGB图像,另外,候选物体也即感兴趣的物体,其类型可以根据具体应用场景进行选择,例如在车辆自动驾驶的应用场景下,候选物体可以包括行人、车辆等。
如图2为本实施例提供的整体网络框架示意图,图2中A所标识的方框指示本实施例提供的一种目标检测主网络,可选的,该目标检测主网络包括多尺度特征提取网络、候选区域提取网络、候选区域特征图池化层以及物体分类与边界框回归全连接层。基于该目标检测主网络的网络架构,本实施例提供了一种目标检测方法,如图3所示为本实施例提供的目标检测方法的流程示意图,具体包括以下步骤:
步骤301、利用多尺度特征提取网络对目标图像进行多尺度特征提取得到不同尺度的特征图;
步骤302、利用候选区域提取网络从不同尺度的特征图中,提取出对应于预设候选区域的特征图;
步骤303、利用候选区域特征图池化层将所有候选区域特征图进行池化操作,将所有候选区域特征图进行尺寸统一;
步骤304、将尺寸统一的候选区域特征图输入至物体分类与边界框回归全连接层进行候选区域分类检测与边界框回归,得到各候选区域的候选物体的类别,以及各候选物体在对应于目标图像的像素坐标系中的二维边界框信息。
具体的,本实施例中的目标检测主网络由多尺度特征提取网络、候选区域提取网络、候选区域特征图池化层、物体分类与边界框回归全连接层这四个模块组成。其中,以车辆自动化驾驶为例,车辆驾驶过程中周围车辆在相机坐标系中的移动范围较大,导致相机坐标系中不同位置的车辆在像素坐标系中成像的尺 寸大小差别较大。本实施例采用多尺度特征提取网络提取输入图像特征,利用深度卷积神经网络中固有的多尺度、多层级的金字塔结构提取单一尺寸输入图像中目标物体的不同尺度特征,使得检测系统具有一定的尺度不变性,可以有效检测出图像中不同尺寸的对象。
进一步地,在本实施例一种可选的实施例中,多尺度特征提取网络为基于ResNet-101的多尺度特征提取网络,基于ResNet-101的多尺度特征提取网络包括自下向上的深层语义特征提取路径和自上向下的深层语义特征融合路径;具体的请参阅图4,在利用基于ResNet-101的多尺度特征提取网络对目标图像进行多尺度特征提取时,将目标图像输入至自下向上的深层语义特征提取路径所提取出的各层语义特征经过1×1卷积核卷积后,通过横向连接的方式与自上向下的深层语义特征融合路径中相同层的语义特征相加融合,得到不同尺度的特征图。通过横向连接方式利用了底层语义的位置细节信息,从而使得融合特征更加精细。
另外,在本实施例中,采用候选区域特征提取网络对多尺度特征图中进行候选区域(也即感兴趣区域)选取。
如图5所示,候选区域特征提取网络是一个全卷积神经网络,对于任意尺度的图像特征图,采用一个大小为n×n的窗口在特征图上滑动,对于每次滑动都有以该窗口中点为锚点的3种不同尺寸和3种不同纵横比的锚点框生成,将图像特征图中各个锚点框区域内特征图映射成一个256维的特征向量,然后将该特征向量分别输入分类全连接层和边界框回归全连接层中即可得到该锚点框对应在输入图像中的候选区域位置和该区域是与不是物体的概率(也即置信度)。由于候选区域提取过程中采用了滑动机制和不同尺寸与纵横比的锚点框,因此候选区域提取网络对输入图像中的目标物体既有平移不变性又有尺度不变性。
还应当说明的是,对于输入图像中一系列任意大小的候选区域,其对应的特征图大小不一,因此不能将其直接输入对大小有固定要求的全连接层中进行候选区域分类检测与边界框回归。基于此,本实施例利用空间金字塔池化层思想设计候选区域特征图池化层,如图6所示,首先对于候选区域提取网络输出的任意大小候选区域,将其对应的特征图均匀的划分为W×H块,然后对每一小块特征 子图进行最大池化操作即可得到大小统一为W×H的特征图,进而将候选区域特征图输入至物体分类与边界框回归全连接层中进行映射。本发明中采用的候选区域特征池化空间为7×7,即W=H=7。
应当理解的是,本实施例中的物体分类与边界框回归全连接层包括物体分类全连接层与物体边界框回归器两个子模块,请继续参阅图2,候选区域特征图池化层的输出特征图经过两层1024维的全连接层映射后,采用softmax函数对候选区域进行例如行人、自行车、汽车、摩托车等候选物体分类,同时也对该候选物体在图像中的二维边界框位置进行估计。
步骤102、获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至第一估计分支网络,控制第一估计分支网络估计目标物体在相机坐标系中的三维方向。
具体的,本实施例中当目标检测主网络的全连接层对候选区域预测的物体为预设类别的目标物体(例如汽车)时,则将其原始的池化候选区域特征图输入至第一估计分支网络,对该目标物体模型在相机坐标系(实际驾驶环境)中的三维方向进行估计。
请再次参阅图2,图2中B所标识的方框指示本实施例提供的一种第一估计分支网络,可选的,该第一估计分支网络为:分类与三维方向估计分支网络。对应的,控制第一估计分支网络估计目标物体在相机坐标系中的三维方向包括:控制分类与三维方向估计分支网络估计目标物体的子类别,以及目标物体在相机坐标系中的三维方向。具体的,可以对目标物体对应区域的特征图经过两层100维的全连接层映射后再采用softmax函数对该“目标物体候选区域”进行子类别检测,同时对该目标物体模型在相机坐标系(实际驾驶环境)中三维方向进行估计。
步骤103、控制第二估计分支网络基于目标物体所对应的二维边界框信息以及特征图,估计目标物体在相机坐标系中的三维位置,然后利用三维位置以及三维方向得到目标物体的六自由度姿态信息。
具体的,在第一估计分支网络估计出目标物体在相机坐标系(实际驾驶环境)中的三维方向之后,由第二估计分支网络融合第一估计分支网络提供的信息并 计算出各目标物体在相机坐标系(实际驾驶环境)中的三维位置,从而实现端到端的目标物体的六自由度姿态估计。应当说明的是,在本实施例的一种实现中,第二估计分支网络基于目标物体所对应的二维边界框信息以及特征图,估计目标物体在相机坐标系中的三维位置时,可以是先将二维边界框信息转换为相机坐标系下的边界框信息,以及通过第一估计网络分支将区域特征图转化为特定维度的向量,然后将转换后的信息输入至第二估计分支网络,将转换后的边界框信息与区域特征信息通过一种级联的方式进行融合,输出三维位置,进而与第一估计分支网络输出的三维方向组成目标物体的六自由度姿态信息。由于这一过程是端到端实现的,可大幅提升运算速度且避免的多阶段处理的误差传递,因此保证了目标物体姿态估计的速率和准确率,进而保证了系统对周围环境感知的及时性和准确性,这将大幅提升自动化控制的决策、控制等性能。还应当理解的是,本实施例所得到的六自由度姿态信息可用于可视化目标来得到可视化结果,可以更直观的将目标向用户进行表现。
请再次参阅图2,图2中C所标识的方框指示本实施例提供的一种第二估计分支网络,对应于第一估计分支网络为分类与三维方向估计分支网络而还输出目标物体的子类别的情况,利用三维位置以及三维方向得到目标物体的六自由度姿态信息包括:利用三维位置、三维方向以及目标物体的子类别,得到各子类别的目标物体的六自由度姿态信息。具体的,将目标检测主网络全连接层的目标边界框位置特征输入两层100维的全连接层,利用目标物体在图像中的二维边界框信息映射后,同时融合来自分类与三维方向估计分支网络的目标物体子类别、目标物体三维方向等信息以提高计算准确率,计算出目标物体在相机坐标系中的三维位置。
进一步可选的,基于本实施例图2所提供的网络框架,为了最小化误差,本实施例的整体卷积神经网络的损失函数为:
;其中,
其中,
为目标检测主网络的全连接层的损失函数;
为第一估计分支网络以及第二估计分支网络的全连接层的损失函数,
为第一估计分支网络中的分类估计损失函数,
为第一估计分支网络中的三维方向估计损失函数,
为目标物体在相机坐标系中三维方向的估计四元数,
为目标物体在相机坐标系中三维方向的真实四元数,
为第二估计分支网络的三维位置估计损失函数,
为目标物体在相机坐标系中三维位置的估计坐标,
为目标物体在相机坐标系中三维位置的真实坐标,
分别为对应于各损失函数的权重超参数。
根据本发明实施例提供的六自由度姿态估计方法,控制目标检测主网络对输入的目标图像进行特征提取,然后检测并输出目标图像中各候选物体的类别,以及各候选物体的二维边界框信息;获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至第一估计分支网络,控制第一估计分支网络估计目标物体在相机坐标系中的三维方向;控制第二估计分支网络基于目标物体的二维边界框信息以及特征图估计目标物体在相机坐标系中的三维位置,然后利用三维位置与三维方向得到目标物体的六自由度姿态信息。通过本发明的实施,由不同网络分支分别估计目标物体的三维方向和三维位置,实现端到端的目标物体周围环境中对象的六自由度姿态估计,有效提升了运算速度以及运算的准确性。
第二实施例:
为了解决相关技术中采用结合深度学习与几何约束的方法进行目标控制对象周围环境的感知时,模型的训练与测试过程比较繁琐、不能实现端到端的训练和测试、周围环境中对象的姿态估计速度慢的技术问题,本实施例示出了一种六自由度姿态估计装置,应用于包括目标检测主网络、第一估计分支网络以及第二估计分支网络的整体卷积神经网络,具体请参见图7,本实施例的六自由度姿态估计装置包括:
检测模块701,用于将目标图像输入至目标检测主网络,控制目标检测主网络对目标图像进行特征提取得到特征图,然后基于特征图检测目标图像中各候选物体的类别,以及各候选物体在对应于目标图像的像素坐标系中的二维边界框信息;
第一估计模块702,用于获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至第一估计分支网络,控制第一估计分支网络估计目标物体在相机坐标系中的三维方向;
第二估计模块703,用于控制第二估计分支网络基于目标物体所对应的二维边界框信息以及特征图,估计目标物体在相机坐标系中的三维位置,然后利用三维位置以及三维方向得到目标物体的六自由度姿态信息。
具体的,本实施例中目标检测主网络对输入的图像进行特征提取,然后检测并输出图像中物体的类别和物体的二维边界框。然后当目标检测主网络的全连接层对候选区域预测的物体为预设类别的目标物体(例如汽车)时,将目标物体对应的特征图输入至第一估计分支网络,对该目标物体模型在相机坐标系(实际驾驶环境)中的三维方向进行估计。在第一估计分支网络估计出目标物体在相机坐标系(实际驾驶环境)中的三维方向之后,由第二估计分支网络融合第一估计分支网络提供的信息并计算出各目标物体在相机坐标系(实际驾驶环境)中的三维位置,从而实现端到端的目标物体的六自由度姿态估计。由于这一过程是端到端实现的,可大幅提升运算速度且避免的多阶段处理的误差传递,因此保证了目标物体姿态估计的速率和准确率,进而保证了系统对周围环境感知的及时性和准确性,这将大幅提升自动化控制的决策、控制等性能。
在本实施例的一些实施方式中,目标检测主网络包括多尺度特征提取网络、候 选区域提取网络、候选区域特征图池化层以及物体分类与边界框回归全连接层;对应的,检测模块701具体用于将目标图像输入至目标检测主网络,利用多尺度特征提取网络对目标图像进行多尺度特征提取得到不同尺度的特征图;利用候选区域提取网络从不同尺度的特征图中,提取出对应于预设候选区域的特征图;利用候选区域特征图池化层将所有候选区域特征图进行池化操作,将所有候选区域特征图进行尺寸统一;将尺寸统一的候选区域特征图输入至物体分类与边界框回归全连接层进行候选区域分类检测与边界框回归,得到各候选区域的候选物体的类别,以及各候选物体在对应于目标图像的像素坐标系中的二维边界框信息。
进一步地,在本实施例的一些实施方式中,多尺度特征提取网络为基于ResNet-101的多尺度特征提取网络,基于ResNet-101的多尺度特征提取网络包括自下向上的深层语义特征提取路径和自上向下的深层语义特征融合路径;对应的,检测模块701在利用多尺度特征提取网络对目标图像进行多尺度特征提取得到不同尺度的特征图时,具体用于将目标图像输入至自下向上的深层语义特征提取路径所提取出的各层语义特征经过1×1卷积核卷积后,通过横向连接的方式与自上向下的深层语义特征融合路径中相同层的语义特征相加融合,得到不同尺度的特征图。
在本实施例的一些实施方式中,第一估计分支网络为:分类与三维方向估计分支网络;对应的,第一估计模块702具体用于获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至分类与三维方向估计分支网络,控制分类与三维方向估计分支网络估计目标物体的子类别,以及目标物体在相机坐标系中的三维方向。第二估计模块703具体用于控制第二估计分支网络基于目标物体所对应的二维边界框信息以及特征图,估计目标物体在相机坐标系中的三维位置,然后利用三维位置、三维方向以及目标物体的子类别,得到各子类别的目标物体的六自由度姿态信息。
应当说明的是,前述实施例中的六自由度姿态估计方法均可基于本实施例提供的六自由度姿态估计装置实现,所属领域的普通技术人员可以清楚的了解到,为描述的方便和简洁,本实施例中所描述的六自由度姿态估计装置的具体工作 过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
采用本实施例提供的六自由度姿态估计装置,控制目标检测主网络对输入的目标图像进行特征提取,然后检测并输出目标图像中各候选物体的类别,以及各候选物体的二维边界框信息;获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至第一估计分支网络,控制第一估计分支网络估计目标物体在相机坐标系中的三维方向;控制第二估计分支网络基于目标物体的二维边界框信息以及特征图估计目标物体在相机坐标系中的三维位置,然后利用三维位置与三维方向得到目标物体的六自由度姿态信息。通过本发明的实施,由不同网络分支分别估计目标物体的三维方向和三维位置,实现端到端的目标物体周围环境中对象的六自由度姿态估计,有效提升了运算速度以及运算的准确性。
第三实施例:
本实施例提供了一种电子装置,参见图8所示,其包括处理器801、存储器802及通信总线803,其中:通信总线803用于实现处理器801和存储器802之间的连接通信;处理器801用于执行存储器802中存储的一个或者多个计算机程序,以实现上述实施例一中的六自由度姿态估计方法中的至少一个步骤。
本实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、计算机程序模块或其他数据)的任何方法或技术中实施的易失性或非易失性、可移除或不可移除的介质。计算机可读存储介质包括但不限于RAM(Random Access Memory,随机存取存储器),ROM(Read-Only Memory,只读存储器),EEPROM(Electrically Erasable Programmable read only memory,带电可擦可编程只读存储器)、闪存或其他存储器技术、CD-ROM(Compact Disc Read-Only Memory,光盘只读存储器),数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。
本实施例中的计算机可读存储介质可用于存储一个或者多个计算机程序,其存储的一个或者多个计算机程序可被处理器执行,以实现上述实施例一中的方法的至少一个步骤。
本实施例还提供了一种计算机程序,该计算机程序可以分布在计算机可读介质 上,由可计算装置来执行,以实现上述实施例一中的方法的至少一个步骤;并且在某些情况下,可以采用不同于上述实施例所描述的顺序执行所示出或描述的至少一个步骤。
本实施例还提供了一种计算机程序产品,包括计算机可读装置,该计算机可读装置上存储有如上所示的计算机程序。本实施例中该计算机可读装置可包括如上所示的计算机可读存储介质。
可见,本领域的技术人员应该明白,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件(可以用计算装置可执行的计算机程序代码来实现)、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。
此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、计算机程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。所以,本发明不限制于任何特定的硬件和软件结合。
以上内容是结合具体的实施方式对本发明实施例所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。

Claims (10)

  1. 一种六自由度姿态估计方法,应用于包括目标检测主网络、第一估计分支网络以及第二估计分支网络的整体卷积神经网络,其特征在于,包括:
    将目标图像输入至所述目标检测主网络,控制所述目标检测主网络对所述目标图像进行特征提取得到特征图,然后基于所述特征图检测所述目标图像中各候选物体的类别,以及所述各候选物体在对应于所述目标图像的像素坐标系中的二维边界框信息;
    获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至所述第一估计分支网络,控制所述第一估计分支网络估计所述目标物体在相机坐标系中的三维方向;
    控制所述第二估计分支网络基于所述目标物体所对应的二维边界框信息以及特征图,估计所述目标物体在所述相机坐标系中的三维位置,然后利用所述三维位置以及所述三维方向得到所述目标物体的六自由度姿态信息。
  2. 如权利要求1所述的六自由度姿态估计方法,其特征在于,所述目标检测主网络包括多尺度特征提取网络、候选区域提取网络、候选区域特征图池化层以及物体分类与边界框回归全连接层;
    所述控制所述目标检测主网络对所述目标图像进行特征提取得到特征图,然后基于所述特征图检测所述目标图像中各候选物体的类别,以及所述各候选物体在对应于所述目标图像的像素坐标系中的二维边界框信息包括:
    利用所述多尺度特征提取网络对所述目标图像进行多尺度特征提取得到不同尺度的特征图;
    利用所述候选区域提取网络从所述不同尺度的特征图中,提取出对应于预设候选区域的特征图;
    利用所述候选区域特征图池化层将所有候选区域特征图进行池化操作,将所述所有候选区域特征图进行尺寸统一;
    将尺寸统一的候选区域特征图输入至所述物体分类与边界框回归全连接层进行候选区域分类检测与边界框回归,得到各候选区域的候选物体的类别,以及所述各候选物体在对应于所述目标图像的像素坐标系中的二维边界框信息。
  3. 如权利要求2所述的六自由度姿态估计方法,其特征在于,所述多尺度特征提取网络为基于ResNet-101的多尺度特征提取网络,所述基于ResNet-101的多尺度特征提取网络包括自下向上的深层语义特征提取路径和自上向下的深层语义特征融合路径;
    所述利用所述多尺度特征提取网络对所述目标图像进行多尺度特征提取得到不同尺度的特征图包括:
    将所述目标图像输入至所述自下向上的深层语义特征提取路径所提取出的各层语义特征经过1×1卷积核卷积后,通过横向连接的方式与所述自上向下的深层语义特征融合路径中相同层的语义特征相加融合,得到不同尺度的特征图。
  4. 如权利要求1所述的六自由度姿态估计方法,其特征在于,所述第一估计分支网络为:分类与三维方向估计分支网络;
    所述控制所述第一估计分支网络估计所述目标物体在相机坐标系中的三维方向包括:
    控制所述分类与三维方向估计分支网络估计所述目标物体的子类别,以及所述目标物体在相机坐标系中的三维方向;
    所述利用所述三维位置以及所述所述三维方向得到所述目标物体的六自由度姿态信息包括:
    利用所述三维位置、所述三维方向以及所述目标物体的子类别,得到各子类别的所述目标物体的六自由度姿态信息。
  5. 如权利要求4所述的六自由度姿态估计方法,其特征在于,所述整体卷积神经网络的损失函数为:
    ;其中,
    其中,
    为所述目标检测主网络的全连接层的损失函数;
    为所述第一估计分支网络以及第二估计分支网络的全连接层的损失函数,
    为所述第一估计分支网络中的分类估计损失函数,
    为所述第一估计分支网络中的三维方向估计损失函数,
    为所述目标物体在所述相机坐标系中三维方向的估计四元数,
    为所述目标物体在所述相机坐标系中三维方向的真实四元数,
    为所述第二估计分支网络的三维位置估计损失函数,
    为所述目标物体在所述相机坐标系中三维位置的估计坐标,
    为所述目标物体在所述相机坐标系中三维位置的真实坐标,
    分别为对应于各损失函数的权重超参数。
  6. 一种六自由度姿态估计装置,应用于包括目标检测主网络、第一估计分支网络以及第二估计分支网络的整体卷积神经网络,其特征在于,包括:
    检测模块,用于将目标图像输入至所述目标检测主网络,控制所述目标检测主网络对所述目标图像进行特征提取得到特征图,然后基于所述特征图检测所述目标图像中各候选物体的类别,以及所述各候选物体在对应于所述目标图像的像素坐标系中的二维边界框信息;
    第一估计模块,用于获取所有候选物体中的预设类别目标物体所对应的特征图,并输入至所述第一估计分支网络,控制所述第一估计分支网络估计所述目标物体在相机坐标系中的三维方向;
    第二估计模块,用于控制所述第二估计分支网络基于所述目标物体所对应的二维边界框信息以及特征图,估计所述目标物体在所述相机坐标系中的三维位置,然后利用所述三维位置以及所述三维方向得到所述目标物体的六自由度姿态信息。
  7. 如权利要求6所述的六自由度姿态估计装置,其特征在于,所述目标检测主网络包括多尺度特征提取网络、候选区域提取网络、候选区域特征图池化层以及物体分类与边界框回归全连接层;
    所述检测模块具体用于将目标图像输入至所述目标检测主网络,利用所述多尺度特征提取网络对所述目标图像进行多尺度特征提取得到不同尺度的特征图;利用所述候选区域提取网络从所述不同尺度的特征图中,提取出对应于预设候选区域的特征图;利用所述候选区域特征图池化层将所有候选区域特征图进行池化操作,将所述所有候选区域特征图进行尺寸统一;将尺寸统一的候选区域特征图输入至所述物体分类与边界框回归全连接层进行候选区域分类检测与边界框回归,得到各候选区域的候选物体的类别,以及所述各候选物体在对应于所述目标图像的像素坐标系中的二维边界框信息。
  8. 如权利要求6所述的六自由度姿态估计装置,其特征在于,所述第一估计分支网络为:分类与三维方向估计分支网络;
    第一估计模块具体用于获取所有候选物体中的预设类别目标物体 所对应的特征图,并输入至所述分类与三维方向估计分支网络,控制所述分类与三维方向估计分支网络估计所述目标物体的子类别,以及所述目标物体在相机坐标系中的三维方向;
    第二估计模块具体用于控制所述第二估计分支网络基于所述目标物体所对应的二维边界框信息以及特征图,估计所述目标物体在所述相机坐标系中的三维位置,然后利用所述三维位置、所述三维方向以及所述目标物体的子类别,得到各子类别的所述目标物体的六自由度姿态信息。
  9. 一种电子装置,其特征在于,包括:处理器、存储器和通信总线;
    所述通信总线用于实现所述处理器和存储器之间的连接通信;
    所述处理器用于执行所述存储器中存储的一个或者多个程序,以实现如权利要求1至5中任意一项所述的六自由度姿态估计方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如权利要求1至5中任意一项所述的六自由度姿态估计方法的步骤。
PCT/CN2019/086883 2019-05-14 2019-05-14 一种六自由度姿态估计方法、装置及计算机可读存储介质 WO2020227933A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/086883 WO2020227933A1 (zh) 2019-05-14 2019-05-14 一种六自由度姿态估计方法、装置及计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/086883 WO2020227933A1 (zh) 2019-05-14 2019-05-14 一种六自由度姿态估计方法、装置及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2020227933A1 true WO2020227933A1 (zh) 2020-11-19

Family

ID=73289967

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/086883 WO2020227933A1 (zh) 2019-05-14 2019-05-14 一种六自由度姿态估计方法、装置及计算机可读存储介质

Country Status (1)

Country Link
WO (1) WO2020227933A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116528062A (zh) * 2023-07-05 2023-08-01 合肥中科类脑智能技术有限公司 多目标追踪方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8311954B2 (en) * 2007-11-29 2012-11-13 Nec Laboratories America, Inc. Recovery of 3D human pose by jointly learning metrics and mixtures of experts
CN104463108A (zh) * 2014-11-21 2015-03-25 山东大学 一种单目实时目标识别及位姿测量方法
CN105809689A (zh) * 2016-03-09 2016-07-27 哈尔滨工程大学 基于机器视觉的船体六自由度测量方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8311954B2 (en) * 2007-11-29 2012-11-13 Nec Laboratories America, Inc. Recovery of 3D human pose by jointly learning metrics and mixtures of experts
CN104463108A (zh) * 2014-11-21 2015-03-25 山东大学 一种单目实时目标识别及位姿测量方法
CN105809689A (zh) * 2016-03-09 2016-07-27 哈尔滨工程大学 基于机器视觉的船体六自由度测量方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116528062A (zh) * 2023-07-05 2023-08-01 合肥中科类脑智能技术有限公司 多目标追踪方法
CN116528062B (zh) * 2023-07-05 2023-09-15 合肥中科类脑智能技术有限公司 多目标追踪方法

Similar Documents

Publication Publication Date Title
CN110119148B (zh) 一种六自由度姿态估计方法、装置及计算机可读存储介质
US11361196B2 (en) Object height estimation from monocular images
US10395377B2 (en) Systems and methods for non-obstacle area detection
CN111161349B (zh) 物体姿态估计方法、装置与设备
CN111079619B (zh) 用于检测图像中的目标对象的方法和装置
KR102472767B1 (ko) 신뢰도에 기초하여 깊이 맵을 산출하는 방법 및 장치
CN110853085B (zh) 基于语义slam的建图方法和装置及电子设备
EP4211651A1 (en) Efficient three-dimensional object detection from point clouds
CN114519853A (zh) 一种基于多模态融合的三维目标检测方法及系统
WO2020227933A1 (zh) 一种六自由度姿态估计方法、装置及计算机可读存储介质
CN114972492A (zh) 一种基于鸟瞰图的位姿确定方法、设备和计算机存储介质
CN113111787A (zh) 目标检测方法、装置、设备以及存储介质
CN114648639B (zh) 一种目标车辆的检测方法、系统及装置
Lai et al. 3D semantic map construction system based on visual SLAM and CNNs
Muresan et al. Stereo and mono depth estimation fusion for an improved and fault tolerant 3D reconstruction
CN115035492A (zh) 车辆识别方法、装置、设备和存储介质
CN114997264A (zh) 训练数据生成、模型训练及检测方法、装置及电子设备
CN114022630A (zh) 三维场景的重建方法、装置、设备和计算机可读存储介质
CN114140660A (zh) 一种车辆检测方法、装置、设备及介质
Zhang et al. A Vision-Centric Approach for Static Map Element Annotation
Tamayo et al. Improving Object Distance Estimation in Automated Driving Systems Using Camera Images, LiDAR Point Clouds and Hierarchical Clustering
Gao et al. Real-time 3D object detection using improved convolutional neural network based on image-driven point cloud
US20220270371A1 (en) Determining Distance of Objects
CN109325962B (zh) 信息处理方法、装置、设备以及计算机可读存储介质
CN116612147A (zh) 一种全景成像中多目标跟踪方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19928593

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19928593

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.03.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19928593

Country of ref document: EP

Kind code of ref document: A1