CN116524340A - AUV near-end docking monocular pose estimation method and device based on dense point reconstruction - Google Patents

AUV near-end docking monocular pose estimation method and device based on dense point reconstruction Download PDF

Info

Publication number
CN116524340A
CN116524340A CN202310364180.0A CN202310364180A CN116524340A CN 116524340 A CN116524340 A CN 116524340A CN 202310364180 A CN202310364180 A CN 202310364180A CN 116524340 A CN116524340 A CN 116524340A
Authority
CN
China
Prior art keywords
pose
target object
image
module
normal vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310364180.0A
Other languages
Chinese (zh)
Inventor
徐元欣
刘诚
陈首旭
单文才
马天珩
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310364180.0A priority Critical patent/CN116524340A/en
Publication of CN116524340A publication Critical patent/CN116524340A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/05Underwater scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an AUV near-end docking monocular pose estimation method and device based on dense point reconstruction, comprising the following steps: acquiring an underwater image through a built-in monocular camera of the AUV; performing image preprocessing on the underwater image; inputting the preprocessed image into a near-end deep learning model for pose estimation, wherein the deep learning model comprises a target object detection module for detecting a target object in the preprocessed image and generating a target object image block; a dense point reconstruction module of the object in the docking station adopts an encoding-decoding network to extract characteristics from the object block diagram and reconstruct dense point coordinates of the object; the normal vector supervision module is used for recovering a surface normal vector diagram of the object from the target object block diagram; and the Edge-PnP pose estimation module learns pose features from the dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates the final 6D pose. The problem that the key points of the traditional method are lost and can not be settled is solved.

Description

AUV near-end docking monocular pose estimation method and device based on dense point reconstruction
Technical Field
The application relates to the technical field of AUV recycling and docking, in particular to an AUV near-end docking monocular pose estimation method and device based on dense point reconstruction.
Background
Autonomous AUV (Autonomous Underwater Vehicle) is an autonomous underwater vehicle, and in the last decades, with the continued development of technology, AUVs have become a key tool for underwater surveying, exploration and monitoring. The underwater supporting facilities such as the fixed underwater docking station, the large underwater mobile docking station and the like provide more comprehensive guarantee for the AUV. Docking with a stationary underwater docking station may provide basic services for the AUV, such as charging, data transfer, and task instructions. The AUV may periodically interface with a stationary underwater docking station to ensure adequate and safe energy and data when performing long or large data acquisition tasks. By docking with a large underwater mobile docking station, the AUV can be deployed and recovered more quickly and safely, and the efficiency and success rate of tasks are improved.
The current underwater AUV vision guiding method generally adopts a passive optical guiding scheme of a lamp array, the scheme adopts a luminous lamp array formed by underwater guiding lamps as an underwater target, the pixel positions of the guiding lamps in an image are extracted through a vision method, and the relative pose between the AUV and the lamp array is calculated by combining prior 3D coordinate information of the guiding lamps. However, the existing lamp array guiding scheme has the following defects: 1. the interference of underwater floaters and the scattering of water bodies easily cause errors in the extraction of the guide lamplight center by a visual method; 2. the guide lamp has fewer self characteristics, can only be used in deep sea dark environments, and is easy to cause error in recognition of key points of the center of the guide lamp when other light sources interfere; 3. after the part of the guide lamps in the lamp array exceeds the field of view of the camera, key points are easily reduced or even lost, so that the relative pose cannot be solved.
Disclosure of Invention
In view of this, the present application provides an AUV near-end docking monocular pose estimation method and apparatus based on dense point reconstruction.
According to a first aspect of an embodiment of the present invention, there is provided an AUV near-end docking monocular pose estimation method based on dense point reconstruction, including:
s11: acquiring an underwater image containing a docking station target object through a built-in monocular camera of the AUV;
s12: performing image preprocessing on the underwater image;
s13: inputting the preprocessed image into a near-end deep learning model for estimating the pose, wherein the near-end deep learning model for estimating the pose comprises a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.
Optionally, performing image preprocessing on the underwater image includes:
and carrying out distortion correction on the underwater image to enable the image to show a correct shape, and then carrying out denoising treatment to obtain a clear underwater image.
Optionally, the target object detection module is configured to detect a target object in the preprocessed image, and generate a target object image block, including:
and detecting the preprocessed underwater image by using a target detector YOLO, and cutting out a target object image block with a fixed size.
Optionally, the dock internal object dense point reconstruction module is configured to extract features from the object block diagram using an encoding-decoding network and reconstruct dense point coordinates of an object, including:
and extracting features from the target object image block by using a coding network, decoding the extracted features, and supplementing and generating dense point coordinates by using multi-scale intermediate internal space features generated in the decoding and reconstruction process.
Optionally, the normal vector supervision module is configured to recover a surface normal vector diagram of an object from the target object block diagram, and includes:
and extracting features from the target object image block by using the coding network, recovering the normal vector of the object surface by using the extracted features, and recovering the mask of the target image block in the recovery process.
Optionally, the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose, including:
for the dense 3D point coordinates of the reconstructed target object and the normal vector diagram of the object surface, combining the 2D pixel point coordinates and the multi-scale middle internal space features of claim 5, and inputting the two points together into an Edge-PnP pose estimation module to extract features;
extracting point-by-point characteristics by utilizing point-to-edge convolution, and finally mapping to 6D dimension output by utilizing a full-connection layer, namely estimating 6D pose information;
the 6D pose information is the position and the pose of the target relative to the monocular camera, the representation forms are a rotation matrix and a translation vector, and the position and the pose are converted into a pose result of the target relative to the AUV through a fixed difference value of the camera and an AUV self coordinate system.
According to a first aspect of an embodiment of the present invention, there is provided an AUV near-end docking monocular pose estimation apparatus based on dense point reconstruction, including:
the acquisition unit is used for acquiring an underwater image containing a docking station target object through a built-in monocular camera of the AUV;
the preprocessing unit is used for preprocessing the underwater image;
the pose estimation unit is used for inputting the preprocessed image into a near-end docking pose estimation deep learning model to perform pose estimation, wherein the near-end docking pose estimation deep learning model comprises a target object detection module, a docking station internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.
According to a third aspect of an embodiment of the present invention, there is provided an electronic apparatus including:
one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.
According to a fourth aspect of embodiments of the present invention there is provided a computer readable storage medium having stored thereon computer instructions which when executed by a processor perform the steps of the method according to the first aspect.
The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:
the preprocessed image is input into a near-end deep learning model for estimating the pose, so that the pose estimation is realized, the end-to-end estimation is realized, the estimation efficiency is greatly improved, and the timeliness is greatly enhanced; by constructing a dense point reconstruction module of an object in the docking station, dense texture information of an input image is fully utilized, and the problems that key points are lost and cannot be resolved in the traditional method are solved; the Edge-PnP pose estimation module based on graph convolution fully utilizes the point-by-point characteristics of dense points, the surface normal vector graph and the middle internal space characteristic to return to the 6D pose, so that the information is sufficient and the precision is high.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flow chart illustrating a method of estimating the pose of an AUV near-end docking based on dense point reconstruction, according to an exemplary embodiment.
FIG. 2 is a schematic diagram of a deep learning model of a near-end docking pose estimation, according to an exemplary embodiment.
FIG. 3 is a schematic diagram illustrating a dense point reconstruction module for an internal target of a docking station, according to an example embodiment.
FIG. 4 is a schematic diagram illustrating an Edge-PnP pose estimation module according to an exemplary embodiment.
FIG. 5 is a schematic diagram of a point-side convolution block diagram according to an example embodiment.
Fig. 6 is a block diagram illustrating an AUV near-end docking pose estimation apparatus based on dense point reconstruction, according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for estimating an AUV near-end docking pose based on dense point reconstruction according to an exemplary embodiment, specifically, the method may include the following steps:
step S11: and acquiring an underwater image containing the docking station target object through a built-in monocular camera of the AUV.
The sample image may be a real image or a composite image. In some application scenarios, the sample image may include a partial real image and a partial composite image. Under the condition that the sample image is a real image, the mode of acquiring the sample image of the underwater docking station target object containing the pose to be estimated is obtained by shooting the underwater docking station target object of the pose to be estimated by the AUV.
Step S12: and carrying out image preprocessing on the underwater image.
The method mainly comprises two parts of image distortion correction and image denoising.
Image distortion correction aims to eliminate geometric distortion of a digital image due to a camera or other factors, and restore the image to a normal state. Image distortion can be largely divided into radial distortion, which is distortion distributed along the radius of the camera lens due to light rays being more curved away from the center of the lens than near the center, and tangential distortion, which can be described by the first few terms of taylor series expansion around the principal point:
wherein the method comprises the steps ofRepresenting the pixel position on the original of the distortion, (u, v) representing the corrected pixel position, k 1 ,k 2 ,k 3 ,k 4 ,k 5 ,k 6 Is the distortion coefficient, and r is the distance from the normalized point coordinate to the principal point. While tangential distortion is due to the lens itself not being parallel to the camera sensor plane, it is generally described by the following mathematical expression:
wherein p is 1 ,p 2 Is the distortion coefficient. The checkerboard calibration image is used as a calibration object, and distortion parameters of the camera are estimated by detecting the angular point positions in the image. And carrying out anti-distortion processing on each pixel point by using the distortion parameters, thereby obtaining a corrected image.
The purpose of denoising the underwater image is to improve the definition and visibility of the underwater image so as to facilitate the analysis and understanding of the underwater image by a subsequent system. The method adopts a median filtering method to remove the noise of the visual guide image, adopts the pixel median value in the pixel neighborhood to replace the original pixel value for each pixel in the image, and assumes that the target pixel value in the image is I before filtering pre (x, y), filtered to I post (x, y) replacing the original target pixel by computing the median pixel of the neighborhood s, can be represented by:
I post (x,y)=mid{I pre (x+Δx,y+Δy),(Δx,Δy)∈s}
and inputting the image subjected to image preprocessing into a deep learning model of the near-end visual guidance, and entering a pose estimation part of the near-end visual guidance.
The underwater image preprocessing is used for improving the quality and definition of the underwater image, so that the accuracy and the robustness of the underwater 6D pose estimation are improved. The method can eliminate or reduce the influence of light attenuation, scattering, color distortion and other factors in the underwater environment on the image, so that the characteristic points in the image are more obvious and easier to identify.
Step S13: inputting the preprocessed image into a near-end deep learning model for estimating the pose, wherein the near-end deep learning model for estimating the pose comprises a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.
Specifically, each module in the deep learning model of the near-end visual pose estimation is subjected to refinement and expansion.
The target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block, and comprises the following steps:
and detecting the preprocessed underwater image by using a target detector YOLO, and cutting out a target object image block with a fixed size.
In particular, referring to fig. 2, fig. 2 is a schematic diagram illustrating a deep learning model of near-end docking pose estimation according to an exemplary embodiment. The first step of the near-end butt-joint pose deep learning model is to detect the preprocessed image by using a target detector YOLO and cut out a target object image block with a fixed size.
After the preprocessed clear underwater image is obtained, a target object image block of a target area is cut out through a target object detection module, the image size is changed, and the global image information of the image block in the original image, namely the position C in the original image, is calculated x ,C y And image block size (h, w).
Three variables are regressed in the 6D pose information for 3D offset: t (T) s =(dx,dy,t z ) Dx, dy represents the offset from the center of the detection frame of the object to the center of the real object, which is not the absolute offset, but rather the training network to predict the relative offset. t is t z Is the depth of the scaling.
Wherein (O) x ,O y ) And (C) x ,C y ) Is the center of the object in the object image block and the objectThe center of the target image block. (W, H) is the pre-processed sharp image size, r=max (W, H)/max (W, H) is the scale-up.
The dock internal object dense point reconstruction module is configured to extract features from the object block diagram using an encoding-decoding network and reconstruct dense point coordinates of an object, including:
and extracting features from the target object image block by using a coding network, decoding the extracted features, and supplementing and generating dense point coordinates by using multi-scale intermediate internal space features generated in the decoding and reconstruction process.
In particular, referring to fig. 3, fig. 3 is a schematic diagram illustrating a dense point reconstruction module for an internal object of a docking station according to an exemplary embodiment. And extracting the progressive features of the target object image block by using a coding network, decoding the extracted features to obtain multi-scale intermediate internal space features generated in the decoding process, forming a feature block by using the multi-scale intermediate internal space features, and jointly supervising and generating dense point coordinates of the target object.
The dense point reconstruction module of the object in the docking station adopts implicit characteristics extracted from the backbone network of the encoder: f (F) base,1/8 ∈R 1024×8×8 ,F skip1,1/4 ∈R 256×16×16 And F skip2,1/2 ∈R 128×32×32 As input, the reconstructed dense 3D coordinate point M is output rec Mask pattern M mask And surface block map M region . Reconstructing output 3D coordinate point M at multiple scales rec In the process, the decoded intermediate feature map F with different resolution and different scale from blurring to sharpness is generated 1/4 、F 1/2 The multiscale reconstruction fusion module fuses and generates a clearer and more accurate object dense coordinate point diagram M rec
Real object surface block map M region-ground-truth Is derived from the true object dense coordinate point map by using the furthest point samples. The true object dense coordinate point map is a coordinate representation in the standardized object coordinate space. The standardized object coordinate space (NOCS, normalized Object Coordinate Space) allows different objects to have a commonReference frame. It is a description of the object in world coordinate system, NOCS means that the object is defined in a 3D space containing unit square, and the coordinates x, y, z E [0,1 ] of the object space]. The real object has dense coordinate points, i.e., NOCS thereof.
Spatial convolution pooling pyramid (Atrous Spatial Pyramid Pooling, ASPP) as context extractor, use F base,1/8 And F skip1,1/4 As input, an intermediate potential representation F is generated i,1/4 . Further, by F skip2,1/2 Obtaining F i,1/2 As another intermediate representation.
A multi-layered frame consisting of four branches follows. The outputs of the three branches located above are used to guide the final object dense point coordinate graph M rec These internal multiscale intermediate internal spatial features of the three branch outputs may be used as a jump connection to construct the relationship between these internal features and the reconstructed dense map during decoding.
Branch 1 contains seven convolution layers, which limit the output dimension to 4, resulting in a low resolution feature layer F o,1/4 ∈R 4 ×16×16 . To approximately match the mask and dense 3D point coordinates of the object, low resolution layer F o,1/4 Upsampled into multi-scale intermediate internal spatial features
The second branch consists of six convolution layers and a nonlinear layer, and F i,1/2 ,US(F i,1/4 ),As its input to create a medium resolution feature layer. Similarly, the multiscale intermediate interior feature output at this stage +.>Can be derived from F o,1/2 Is derived from the above.
The third branch consists of an up-sampling layer and two convolution layersComposition, which generates high resolution feature layer F o,1 ∈R 69 ×64×64 。F o,1 The first four of 69 channels of (a) are used to construct another multiscale intermediate internal space feature (spatial supervision)The remaining 65 channels form surface block map M region
The branches 4 are constructed directly by the upsampling layer and the convolution layer to obtain the global feature cues F clue ∈R n×64×64
These multiscale intermediate interior spatial features (or so-called multiscale spatial supervision) and the final reconstructed object dense point coordinates M rec With exactly the same spatial dimensions, these multi-scale intermediate internal spatial features of different scales can characterize object details from rough contours to detailed surface features. The mask map, reconstructed dense point coordinates, and surface block map are jointly generated by:
wherein W is i,j ,i,j∈[1,2,3,4]Representing a corresponding convolution transformation, f represents an activation function,representing a feature matrix->Is provided in the first channel of the first optical fiber. In terms of the design of the loss function,is reconstructed dense point coordinates, +.>And->Is an estimated mask map and an estimated surface block map.
According to the scheme, in the decoding process, the feature blocks are formed by utilizing the multi-scale middle internal space features, and the dense point coordinates of the generated target object are supervised together, so that the dense point coordinates of the 3D target object with higher precision can be obtained.
The normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram, and comprises the following steps:
and extracting features from the target object image block by using the coding network, recovering the normal vector of the object surface by using the extracted features, and recovering the mask of the target image block in the recovery process.
In particular, object surface normal vectors are recovered using a normal vector supervision module because it is difficult to recover rich surface information from pictures of non-textured objects. And extracting features from the target object image blocks by using the coding network, and recovering the normal vector diagram of the object surface by using the extracted features in the decoding network, wherein the mask of the target image blocks is recovered in the recovery process.
Wherein the generated normal vector diagram of the surface of the visible object can represent the detail with finer granularity of the surface of the object, and the normal vector diagram M of the surface of the object is utilized nor The geometric details of the frontal visual field of the object are encoded.
Obtaining a true 2D surface normal vector map M from a Normalized Object Coordinate Space (NOCS) nor Constructing directional derivative characterization of the NOCS diagram along the x axis and the y axis by using a Sobel operator: d (D) x ,D y . The normal vector at pixel point p= (x, y) is formulated as:
N (x,y) =D x ×D y
x represents cross. This operation generates singular values at the edges of the object, filters the boundary outliers using an erosion algorithm, and generates an erosion mask M ero To segment the object.
The Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose, and the Edge-PnP pose estimation module comprises:
for the dense 3D point coordinates of the reconstructed target object and the normal vector diagram of the object surface, combining the 2D pixel point coordinates and the multi-scale middle internal space features of claim 5, and inputting the two points together into an Edge-PnP pose estimation module to extract features;
extracting point-by-point characteristics by utilizing point-to-edge convolution, and finally mapping to 6D dimension output by utilizing a full-connection layer, namely estimating 6D pose information;
the 6D pose information is the position and the pose of the target relative to the monocular camera, the representation forms are a rotation matrix and a translation vector, and the position and the pose are converted into a pose result of the target relative to the AUV through a fixed difference value of the camera and an AUV self coordinate system.
Specifically, referring to fig. 4, fig. 4 is a schematic diagram illustrating an Edge-PnP pose estimation module according to an exemplary embodiment.
Utilizing a point-edge convolution graph convolution block to extract point-by-point characteristic information from dense 3D point coordinates of the reconstruction target object and the object surface normal vector graph by combining with 2D pixel point coordinates; extracting point-by-point characteristic information by combining the multi-scale middle internal space characteristic with the 2D pixel point coordinate by using another point-edge convolution diagram convolution block; and finally, mapping the full connection layer to 6D dimension output, namely estimating 6D pose information.
The multi-scale intermediate internal space feature not only can be used as a space guide for generating object dense 3D point coordinates, but also can simplify the gesture estimation process together with the generated object dense 3D point coordinates. Since these multi-scale features characterize the details and state of the object from rough shape to precise coordinates, respectively, they help to roughly identify the pose before finely estimating it, forming a pre-estimation process.
Referring to fig. 5, fig. 5 is a schematic diagram of a point-side convolution block according to an exemplary embodiment.
Feature map F for input multiscale geometry supervision edge_in ∈R 78×64×64 With f=78-dimensional point x= { X having n=64×64 points 1 ,…,x n Characterizing feature point diagrams.
Calculate a directed graph g= (V, E) representing a local point cloud structure, where v= {1,..Respectively a vertex set and an edge set. G is constructed as a k-nearest neighbor (k-NN) map of X.
Wherein the graph includes self-loops, which means that each node points to itself as well.
Edge feature is defined as e ij =h Θ (x i ,x j ) Wherein h is Θ :R F ×R F →R F′ Is a nonlinear function with a set of learnable parameters theta.
Edge convolution operations (using summation operations or maxima) are defined by applying a channel symmetric aggregation operation on the edge features associated with all edges from each vertex:
will x i Regarded as the center pixel, will { x } j (i, j) ∈ε } is considered as one neighbor around it. Given an F-dimensional point cloud with n points, edge convolution generates an F' dimensional point cloud with the same number of points.
x 1 ,…,x n Representing image pixels on a regular grid, and graph G has connectivity representing fixed-size image blocks around each pixel.
h is defined as
h Θ (x i ,x j )=h Θ (x i ,x j -x i ,x j )
This will be explicitly defined by the coordinate x of the center of the image block i Captured global shape structure and method of forming the same by x j -x i ,x j The captured local neighborhood information is combined. The operator is defined as follows
By shared MLP implementation, while
Here Θ= (θ) 1 ,…,θ m1 ,…,φ m1 ,…,γ m )。
A pair-wise distance matrix is calculated in the feature space and then the nearest k-point is taken for each point.
The whole Edge-PnP pose estimation module is defined and updated through the following implementation, and finally a 6D pose result is obtained by estimating the Edge-PnP pose by a fully-connected (FC) network.
Defining a symbol: x= { X 1 ,…,x n The number of points is represented by the number of points, the dimension of the feature vector is represented by the number of neighbors is represented by the number of points, and the number of neighbors is represented by the number of points. For each point x i Its k nearest neighbors are found, constructing a kNN graph G.
For each edge (x i ,x j ) Calculate its feature vector x j -x i ]Wherein x is i And x j Is a feature vector of both endpoints. This feature vector can capture the relative position and orientation between the two points, as well as the information of the two points themselves.
For each point x i Splice it with feature vectors of all its neighbors to obtain a matrix M i =[x i ,x j ,x j -x i ]Wherein e is ij Is all and x i Feature vectors of connected edges.
For each point x i For its matrix M i Applying an h function, and obtaining a new feature vector e 'through a multi-layer perceptron (MLP)' ijm =MLP(M i ). MLP is a fully connected neural network that can learn nonlinear transformations.
For each point x i For all its neighborsMaximizing pooling (max-pooling) of new feature vectors to obtain an aggregate feature vector g i =maxpool(e′ ijm ). Maximum pooling is a dimension-reduction operation that can extract the most significant features.
For each point x i Splice its own new feature vector and aggregate feature vector to obtain a final feature vector x' i =[e′ ijm ;g i ]. This allows the information of the point itself and the neighborhood information to be preserved.
Therefore, the characteristic is processed by using the point-edge convolution diagram convolution block capable of efficiently processing the point data, the perception of the characteristic by a method can be improved extremely, and the prediction precision is improved.
The preprocessed image is subjected to direct regression to obtain 6D pose information by utilizing the full end-to-end pose estimation network, and the network is trained by utilizing the following loss function for training the end-to-end network:
L=L pose +L rec +L nor
wherein L is pose Is the 6D pose loss. The 6D pose in the training process obtains a true value through the homemade dataset.
Three variables are regressed in the 6D pose information for 3D offset: t (T) s =(dx,dy,t z ) Dx, dy represents the offset from the center of the detection frame of the object to the center of the real object, which is not the absolute offset, but rather the training network to predict the relative offset. t is t z Is the depth of the scaling.
Recovering 3D offsets in the predicted 6D pose by:
wherein f x ,f y Is the focal length of the monocular camera.
Reconstruction losses include: using L 1 Loss of object dense point coordinates M to supervise reconstruction rec And a segmentation mask M mask And uses cross entropy loss CE to supervise the surface block map M region
Only the pixel coordinate points belonging to the object visible in the object image block are considered,pixel points representing a dense graph of real objects, and I i Representing pixels of a reconstructed dense graph of the object, whereas M represents all pixel geometries of the object,/->I seg i Representing the pixels of the true mask map and the pixels of the predictive mask map, respectively.
For the surface normal vector map, the cosine distance is used to measure the difference between the predicted surface normal vector map and the corresponding real object surface normal vector map, and L is used for the erosion mask map 1 Loss.
Wherein N is (xy) ,Respectively representing a predicted target object surface normal vector diagram and a real surface normal vector diagram at coordinates (x, y).
The final regression is 6D pose information [ R, T ] x ,T y ,T z ]. The 6D pose information is the position and the pose of the object relative to the camera, the representation forms are a rotation matrix and a translation vector, and the position and the pose are determined by the fixed difference value of the camera and an AUV self coordinate systemThe state is converted into a pose result of the target relative to the AUV.
Corresponding to the embodiment of the AUV near-end butt-joint pose estimation method based on dense point reconstruction, the application also provides an embodiment of an AUV near-end butt-joint pose estimation device based on dense point reconstruction.
Fig. 6 is a block diagram illustrating an AUV near-end docking pose estimation method based on dense point reconstruction according to an exemplary embodiment. Referring to FIG. 6, the apparatus includes
An acquisition unit 11 for acquiring an underwater image containing a docking target object through a built-in monocular camera of the AUV;
a preprocessing unit 12 for performing image preprocessing on the underwater image;
the pose estimation unit 13 is configured to input the preprocessed image into a near-end docking pose estimated deep learning model for pose estimation, where the near-end visual pose estimated deep learning model includes a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module, and a graph convolution-based Edge-PnP pose estimation module, where the target object detection module is configured to detect a target object in the preprocessed image and generate a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for estimating an AUV near-end docking pose based on dense point reconstruction as described above.
Correspondingly, the application also provides a computer readable storage medium, on which computer instructions are stored, which when executed by a processor, implement the AUV near-end docking pose estimation method based on dense point reconstruction.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (9)

1. The AUV near-end docking monocular pose estimation method based on dense point reconstruction is characterized by comprising the following steps of:
acquiring an underwater image containing a docking station target object through a built-in monocular camera of the AUV;
performing image preprocessing on the underwater image;
inputting the preprocessed image into a near-end deep learning model for estimating the pose, wherein the near-end deep learning model for estimating the pose comprises a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.
2. The method of claim 1, wherein image preprocessing the underwater image comprises:
and carrying out distortion correction on the underwater image to enable the image to show a correct shape, and then carrying out denoising treatment to obtain a clear underwater image.
3. The method of claim 1, wherein the object detection module is configured to detect an object in the preprocessed image and generate an object image block, and comprises:
and detecting the preprocessed underwater image by using a target detector YOLO, and cutting out a target object image block with a fixed size.
4. The method of claim 1, wherein the dock internal object dense point reconstruction module extracts features from the object block map using an encoding-decoding network and reconstructs dense point coordinates of an object, comprising:
and extracting features from the target object image block by using a coding network, decoding the extracted features, and supplementing and generating dense point coordinates by using multi-scale intermediate internal space features generated in the decoding and reconstruction process.
5. The method of claim 1, wherein the normal vector supervision module is configured to recover a surface normal vector map of an object from the target block map, and comprises:
and extracting features from the target object image block by using the coding network, recovering the normal vector of the object surface by using the extracted features, and recovering the mask of the target image block in the recovery process.
6. The method of claim 5, wherein the Edge-PnP pose estimation module utilizing a graph rolling network to learn pose features from dense point coordinates of the target object and a surface normal vector graph of the object and to estimate a final 6D pose comprises:
for the dense 3D point coordinates of the reconstructed target object and the normal vector diagram of the object surface, combining the 2D pixel point coordinates and the multi-scale middle internal space features of claim 5, and inputting the two points together into an Edge-PnP pose estimation module to extract features;
extracting point-by-point characteristics by utilizing point-to-edge convolution, and finally mapping to 6D dimension output by utilizing a full-connection layer, namely estimating 6D pose information;
the 6D pose information is the position and the pose of the target relative to the monocular camera, the representation forms are a rotation matrix and a translation vector, and the position and the pose are converted into a pose result of the target relative to the AUV through a fixed difference value of the camera and an AUV self coordinate system.
7. An AUV near-end docking monocular pose estimation device based on dense point reconstruction is characterized by comprising:
the acquisition unit is used for acquiring an underwater image containing a docking station target object through a built-in monocular camera of the AUV;
the preprocessing unit is used for preprocessing the underwater image;
the pose estimation unit is used for inputting the preprocessed image into a near-end docking pose estimation deep learning model to perform pose estimation, wherein the near-end docking pose estimation deep learning model comprises a target object detection module, a docking station internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.
8. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.
9. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any of claims 1-6.
CN202310364180.0A 2023-04-07 2023-04-07 AUV near-end docking monocular pose estimation method and device based on dense point reconstruction Pending CN116524340A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310364180.0A CN116524340A (en) 2023-04-07 2023-04-07 AUV near-end docking monocular pose estimation method and device based on dense point reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310364180.0A CN116524340A (en) 2023-04-07 2023-04-07 AUV near-end docking monocular pose estimation method and device based on dense point reconstruction

Publications (1)

Publication Number Publication Date
CN116524340A true CN116524340A (en) 2023-08-01

Family

ID=87391230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310364180.0A Pending CN116524340A (en) 2023-04-07 2023-04-07 AUV near-end docking monocular pose estimation method and device based on dense point reconstruction

Country Status (1)

Country Link
CN (1) CN116524340A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117635679A (en) * 2023-12-05 2024-03-01 之江实验室 Curved surface efficient reconstruction method and device based on pre-training diffusion probability model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117635679A (en) * 2023-12-05 2024-03-01 之江实验室 Curved surface efficient reconstruction method and device based on pre-training diffusion probability model
CN117635679B (en) * 2023-12-05 2024-05-28 之江实验室 Curved surface efficient reconstruction method and device based on pre-training diffusion probability model

Similar Documents

Publication Publication Date Title
CN110135455A (en) Image matching method, device and computer readable storage medium
CN108133456A (en) Face super-resolution reconstruction method, reconstructing apparatus and computer system
CN109509164B (en) Multi-sensor image fusion method and system based on GDGF
CN111986240A (en) Drowning person detection method and system based on visible light and thermal imaging data fusion
CN115861601B (en) Multi-sensor fusion sensing method and device
CN112465759A (en) Convolutional neural network-based aeroengine blade defect detection method
CN114550021B (en) Surface defect detection method and device based on feature fusion
CN117132914B (en) Method and system for identifying large model of universal power equipment
CN116524340A (en) AUV near-end docking monocular pose estimation method and device based on dense point reconstruction
CN111104532B (en) RGBD image joint recovery method based on double-flow network
CN115330935A (en) Three-dimensional reconstruction method and system based on deep learning
CN117928565B (en) Polarization navigation orientation method under complex shielding environment
CN116229394A (en) Automatic driving image recognition method, device and recognition equipment
CN115829942A (en) Electronic circuit defect detection method based on non-negative constraint sparse self-encoder
CN115035172A (en) Depth estimation method and system based on confidence degree grading and inter-stage fusion enhancement
CN113421210B (en) Surface point Yun Chong construction method based on binocular stereoscopic vision
CN113763261B (en) Real-time detection method for far small target under sea fog weather condition
Nouduri et al. Deep realistic novel view generation for city-scale aerial images
CN112116538B (en) Ocean survey image quality enhancement method based on deep neural network
CN115147576A (en) Underwater robot docking monocular vision guiding method based on key characteristics
Kallasi et al. Object detection and pose estimation algorithms for underwater manipulation
CN114120129A (en) Three-dimensional identification method for landslide slip surface based on unmanned aerial vehicle image and deep learning
Wasielewski et al. Dynamic vision for ROV stabilization
Xu et al. Monocular video frame optimization through feature-based parallax analysis for 3D pipe reconstruction
CN113505650A (en) Method, device and equipment for extracting topographic feature line

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination