CN116524340A

CN116524340A - AUV near-end docking monocular pose estimation method and device based on dense point reconstruction

Info

Publication number: CN116524340A
Application number: CN202310364180.0A
Authority: CN
Inventors: 徐元欣; 刘诚; 陈首旭; 单文才; 马天珩; 王鹏
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-08-01

Abstract

The invention discloses an AUV near-end docking monocular pose estimation method and device based on dense point reconstruction, comprising the following steps: acquiring an underwater image through a built-in monocular camera of the AUV; performing image preprocessing on the underwater image; inputting the preprocessed image into a near-end deep learning model for pose estimation, wherein the deep learning model comprises a target object detection module for detecting a target object in the preprocessed image and generating a target object image block; a dense point reconstruction module of the object in the docking station adopts an encoding-decoding network to extract characteristics from the object block diagram and reconstruct dense point coordinates of the object; the normal vector supervision module is used for recovering a surface normal vector diagram of the object from the target object block diagram; and the Edge-PnP pose estimation module learns pose features from the dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates the final 6D pose. The problem that the key points of the traditional method are lost and can not be settled is solved.

Description

AUV near-end docking monocular pose estimation method and device based on dense point reconstruction

Technical Field

The application relates to the technical field of AUV recycling and docking, in particular to an AUV near-end docking monocular pose estimation method and device based on dense point reconstruction.

Background

Autonomous AUV (Autonomous Underwater Vehicle) is an autonomous underwater vehicle, and in the last decades, with the continued development of technology, AUVs have become a key tool for underwater surveying, exploration and monitoring. The underwater supporting facilities such as the fixed underwater docking station, the large underwater mobile docking station and the like provide more comprehensive guarantee for the AUV. Docking with a stationary underwater docking station may provide basic services for the AUV, such as charging, data transfer, and task instructions. The AUV may periodically interface with a stationary underwater docking station to ensure adequate and safe energy and data when performing long or large data acquisition tasks. By docking with a large underwater mobile docking station, the AUV can be deployed and recovered more quickly and safely, and the efficiency and success rate of tasks are improved.

The current underwater AUV vision guiding method generally adopts a passive optical guiding scheme of a lamp array, the scheme adopts a luminous lamp array formed by underwater guiding lamps as an underwater target, the pixel positions of the guiding lamps in an image are extracted through a vision method, and the relative pose between the AUV and the lamp array is calculated by combining prior 3D coordinate information of the guiding lamps. However, the existing lamp array guiding scheme has the following defects: 1. the interference of underwater floaters and the scattering of water bodies easily cause errors in the extraction of the guide lamplight center by a visual method; 2. the guide lamp has fewer self characteristics, can only be used in deep sea dark environments, and is easy to cause error in recognition of key points of the center of the guide lamp when other light sources interfere; 3. after the part of the guide lamps in the lamp array exceeds the field of view of the camera, key points are easily reduced or even lost, so that the relative pose cannot be solved.

Disclosure of Invention

In view of this, the present application provides an AUV near-end docking monocular pose estimation method and apparatus based on dense point reconstruction.

According to a first aspect of an embodiment of the present invention, there is provided an AUV near-end docking monocular pose estimation method based on dense point reconstruction, including:

s11: acquiring an underwater image containing a docking station target object through a built-in monocular camera of the AUV;

s12: performing image preprocessing on the underwater image;

s13: inputting the preprocessed image into a near-end deep learning model for estimating the pose, wherein the near-end deep learning model for estimating the pose comprises a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.

Optionally, performing image preprocessing on the underwater image includes:

and carrying out distortion correction on the underwater image to enable the image to show a correct shape, and then carrying out denoising treatment to obtain a clear underwater image.

Optionally, the target object detection module is configured to detect a target object in the preprocessed image, and generate a target object image block, including:

and detecting the preprocessed underwater image by using a target detector YOLO, and cutting out a target object image block with a fixed size.

Optionally, the dock internal object dense point reconstruction module is configured to extract features from the object block diagram using an encoding-decoding network and reconstruct dense point coordinates of an object, including:

and extracting features from the target object image block by using a coding network, decoding the extracted features, and supplementing and generating dense point coordinates by using multi-scale intermediate internal space features generated in the decoding and reconstruction process.

Optionally, the normal vector supervision module is configured to recover a surface normal vector diagram of an object from the target object block diagram, and includes:

and extracting features from the target object image block by using the coding network, recovering the normal vector of the object surface by using the extracted features, and recovering the mask of the target image block in the recovery process.

Optionally, the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose, including:

for the dense 3D point coordinates of the reconstructed target object and the normal vector diagram of the object surface, combining the 2D pixel point coordinates and the multi-scale middle internal space features of claim 5, and inputting the two points together into an Edge-PnP pose estimation module to extract features;

extracting point-by-point characteristics by utilizing point-to-edge convolution, and finally mapping to 6D dimension output by utilizing a full-connection layer, namely estimating 6D pose information;

the 6D pose information is the position and the pose of the target relative to the monocular camera, the representation forms are a rotation matrix and a translation vector, and the position and the pose are converted into a pose result of the target relative to the AUV through a fixed difference value of the camera and an AUV self coordinate system.

According to a first aspect of an embodiment of the present invention, there is provided an AUV near-end docking monocular pose estimation apparatus based on dense point reconstruction, including:

the acquisition unit is used for acquiring an underwater image containing a docking station target object through a built-in monocular camera of the AUV;

the preprocessing unit is used for preprocessing the underwater image;

the pose estimation unit is used for inputting the preprocessed image into a near-end docking pose estimation deep learning model to perform pose estimation, wherein the near-end docking pose estimation deep learning model comprises a target object detection module, a docking station internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.

According to a third aspect of an embodiment of the present invention, there is provided an electronic apparatus including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.

According to a fourth aspect of embodiments of the present invention there is provided a computer readable storage medium having stored thereon computer instructions which when executed by a processor perform the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

the preprocessed image is input into a near-end deep learning model for estimating the pose, so that the pose estimation is realized, the end-to-end estimation is realized, the estimation efficiency is greatly improved, and the timeliness is greatly enhanced; by constructing a dense point reconstruction module of an object in the docking station, dense texture information of an input image is fully utilized, and the problems that key points are lost and cannot be resolved in the traditional method are solved; the Edge-PnP pose estimation module based on graph convolution fully utilizes the point-by-point characteristics of dense points, the surface normal vector graph and the middle internal space characteristic to return to the 6D pose, so that the information is sufficient and the precision is high.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flow chart illustrating a method of estimating the pose of an AUV near-end docking based on dense point reconstruction, according to an exemplary embodiment.

FIG. 2 is a schematic diagram of a deep learning model of a near-end docking pose estimation, according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a dense point reconstruction module for an internal target of a docking station, according to an example embodiment.

FIG. 4 is a schematic diagram illustrating an Edge-PnP pose estimation module according to an exemplary embodiment.

FIG. 5 is a schematic diagram of a point-side convolution block diagram according to an example embodiment.

Fig. 6 is a block diagram illustrating an AUV near-end docking pose estimation apparatus based on dense point reconstruction, according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for estimating an AUV near-end docking pose based on dense point reconstruction according to an exemplary embodiment, specifically, the method may include the following steps:

step S11: and acquiring an underwater image containing the docking station target object through a built-in monocular camera of the AUV.

The sample image may be a real image or a composite image. In some application scenarios, the sample image may include a partial real image and a partial composite image. Under the condition that the sample image is a real image, the mode of acquiring the sample image of the underwater docking station target object containing the pose to be estimated is obtained by shooting the underwater docking station target object of the pose to be estimated by the AUV.

Step S12: and carrying out image preprocessing on the underwater image.

The method mainly comprises two parts of image distortion correction and image denoising.

Image distortion correction aims to eliminate geometric distortion of a digital image due to a camera or other factors, and restore the image to a normal state. Image distortion can be largely divided into radial distortion, which is distortion distributed along the radius of the camera lens due to light rays being more curved away from the center of the lens than near the center, and tangential distortion, which can be described by the first few terms of taylor series expansion around the principal point:

wherein the method comprises the steps ofRepresenting the pixel position on the original of the distortion, (u, v) representing the corrected pixel position, k ₁ ,k ₂ ,k ₃ ,k ₄ ,k ₅ ,k ₆ Is the distortion coefficient, and r is the distance from the normalized point coordinate to the principal point. While tangential distortion is due to the lens itself not being parallel to the camera sensor plane, it is generally described by the following mathematical expression:

wherein p is ₁ ,p ₂ Is the distortion coefficient. The checkerboard calibration image is used as a calibration object, and distortion parameters of the camera are estimated by detecting the angular point positions in the image. And carrying out anti-distortion processing on each pixel point by using the distortion parameters, thereby obtaining a corrected image.

The purpose of denoising the underwater image is to improve the definition and visibility of the underwater image so as to facilitate the analysis and understanding of the underwater image by a subsequent system. The method adopts a median filtering method to remove the noise of the visual guide image, adopts the pixel median value in the pixel neighborhood to replace the original pixel value for each pixel in the image, and assumes that the target pixel value in the image is I before filtering _pre (x, y), filtered to I _post (x, y) replacing the original target pixel by computing the median pixel of the neighborhood s, can be represented by:

I _post (x,y)＝mid{I _pre (x+Δx,y+Δy),(Δx,Δy)∈s}

and inputting the image subjected to image preprocessing into a deep learning model of the near-end visual guidance, and entering a pose estimation part of the near-end visual guidance.

The underwater image preprocessing is used for improving the quality and definition of the underwater image, so that the accuracy and the robustness of the underwater 6D pose estimation are improved. The method can eliminate or reduce the influence of light attenuation, scattering, color distortion and other factors in the underwater environment on the image, so that the characteristic points in the image are more obvious and easier to identify.

Step S13: inputting the preprocessed image into a near-end deep learning model for estimating the pose, wherein the near-end deep learning model for estimating the pose comprises a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.

Specifically, each module in the deep learning model of the near-end visual pose estimation is subjected to refinement and expansion.

The target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block, and comprises the following steps:

In particular, referring to fig. 2, fig. 2 is a schematic diagram illustrating a deep learning model of near-end docking pose estimation according to an exemplary embodiment. The first step of the near-end butt-joint pose deep learning model is to detect the preprocessed image by using a target detector YOLO and cut out a target object image block with a fixed size.

After the preprocessed clear underwater image is obtained, a target object image block of a target area is cut out through a target object detection module, the image size is changed, and the global image information of the image block in the original image, namely the position C in the original image, is calculated _x ,C _y And image block size (h, w).

Three variables are regressed in the 6D pose information for 3D offset: t (T) _s ＝(dx,dy,t _z ) Dx, dy represents the offset from the center of the detection frame of the object to the center of the real object, which is not the absolute offset, but rather the training network to predict the relative offset. t is t _z Is the depth of the scaling.

Wherein (O) _x ,O _y ) And (C) _x ,C _y ) Is the center of the object in the object image block and the objectThe center of the target image block. (W, H) is the pre-processed sharp image size, r=max (W, H)/max (W, H) is the scale-up.

The dock internal object dense point reconstruction module is configured to extract features from the object block diagram using an encoding-decoding network and reconstruct dense point coordinates of an object, including:

In particular, referring to fig. 3, fig. 3 is a schematic diagram illustrating a dense point reconstruction module for an internal object of a docking station according to an exemplary embodiment. And extracting the progressive features of the target object image block by using a coding network, decoding the extracted features to obtain multi-scale intermediate internal space features generated in the decoding process, forming a feature block by using the multi-scale intermediate internal space features, and jointly supervising and generating dense point coordinates of the target object.

The dense point reconstruction module of the object in the docking station adopts implicit characteristics extracted from the backbone network of the encoder: f (F) _base,1/8 ∈R ^1024×8×8 ，F _skip1,1/4 ∈R ^256×16×16 And F _skip2,1/2 ∈R ^128×32×32 As input, the reconstructed dense 3D coordinate point M is output _rec Mask pattern M _mask And surface block map M _region . Reconstructing output 3D coordinate point M at multiple scales _rec In the process, the decoded intermediate feature map F with different resolution and different scale from blurring to sharpness is generated _1/4 、F _1/2 The multiscale reconstruction fusion module fuses and generates a clearer and more accurate object dense coordinate point diagram M _rec 。

Real object surface block map M _{region-ground-truth} Is derived from the true object dense coordinate point map by using the furthest point samples. The true object dense coordinate point map is a coordinate representation in the standardized object coordinate space. The standardized object coordinate space (NOCS, normalized Object Coordinate Space) allows different objects to have a commonReference frame. It is a description of the object in world coordinate system, NOCS means that the object is defined in a 3D space containing unit square, and the coordinates x, y, z E [0,1 ] of the object space]. The real object has dense coordinate points, i.e., NOCS thereof.

Spatial convolution pooling pyramid (Atrous Spatial Pyramid Pooling, ASPP) as context extractor, use F _base,1/8 And F _skip1,1/4 As input, an intermediate potential representation F is generated _i,1/4 . Further, by F _skip2,1/2 Obtaining F _i,1/2 As another intermediate representation.

A multi-layered frame consisting of four branches follows. The outputs of the three branches located above are used to guide the final object dense point coordinate graph M _rec These internal multiscale intermediate internal spatial features of the three branch outputs may be used as a jump connection to construct the relationship between these internal features and the reconstructed dense map during decoding.

Branch 1 contains seven convolution layers, which limit the output dimension to 4, resulting in a low resolution feature layer F _o,1/4 ∈R ⁴ ^×16×16 . To approximately match the mask and dense 3D point coordinates of the object, low resolution layer F _o,1/4 Upsampled into multi-scale intermediate internal spatial features

The second branch consists of six convolution layers and a nonlinear layer, and F _i,1/2 ，US(F _i,1/4 )，As its input to create a medium resolution feature layer. Similarly, the multiscale intermediate interior feature output at this stage +.>Can be derived from F _o,1/2 Is derived from the above.

The third branch consists of an up-sampling layer and two convolution layersComposition, which generates high resolution feature layer F _o,1 ∈R ⁶⁹ ^×64×64 。F _o,1 The first four of 69 channels of (a) are used to construct another multiscale intermediate internal space feature (spatial supervision)The remaining 65 channels form surface block map M _region 。

The branches 4 are constructed directly by the upsampling layer and the convolution layer to obtain the global feature cues F _clue ∈R ^n×64×64 。

These multiscale intermediate interior spatial features (or so-called multiscale spatial supervision) and the final reconstructed object dense point coordinates M _rec With exactly the same spatial dimensions, these multi-scale intermediate internal spatial features of different scales can characterize object details from rough contours to detailed surface features. The mask map, reconstructed dense point coordinates, and surface block map are jointly generated by:

wherein W is _i,j ,i,j∈[1,2,3,4]Representing a corresponding convolution transformation, f represents an activation function,representing a feature matrix->Is provided in the first channel of the first optical fiber. In terms of the design of the loss function,is reconstructed dense point coordinates, +.>And->Is an estimated mask map and an estimated surface block map.

According to the scheme, in the decoding process, the feature blocks are formed by utilizing the multi-scale middle internal space features, and the dense point coordinates of the generated target object are supervised together, so that the dense point coordinates of the 3D target object with higher precision can be obtained.

The normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram, and comprises the following steps:

In particular, object surface normal vectors are recovered using a normal vector supervision module because it is difficult to recover rich surface information from pictures of non-textured objects. And extracting features from the target object image blocks by using the coding network, and recovering the normal vector diagram of the object surface by using the extracted features in the decoding network, wherein the mask of the target image blocks is recovered in the recovery process.

Wherein the generated normal vector diagram of the surface of the visible object can represent the detail with finer granularity of the surface of the object, and the normal vector diagram M of the surface of the object is utilized _nor The geometric details of the frontal visual field of the object are encoded.

Obtaining a true 2D surface normal vector map M from a Normalized Object Coordinate Space (NOCS) _nor Constructing directional derivative characterization of the NOCS diagram along the x axis and the y axis by using a Sobel operator: d (D) _x ，D _y . The normal vector at pixel point p= (x, y) is formulated as:

N ^(x,y) ＝D _x ×D _y

x represents cross. This operation generates singular values at the edges of the object, filters the boundary outliers using an erosion algorithm, and generates an erosion mask M _ero To segment the object.

The Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose, and the Edge-PnP pose estimation module comprises:

Specifically, referring to fig. 4, fig. 4 is a schematic diagram illustrating an Edge-PnP pose estimation module according to an exemplary embodiment.

Utilizing a point-edge convolution graph convolution block to extract point-by-point characteristic information from dense 3D point coordinates of the reconstruction target object and the object surface normal vector graph by combining with 2D pixel point coordinates; extracting point-by-point characteristic information by combining the multi-scale middle internal space characteristic with the 2D pixel point coordinate by using another point-edge convolution diagram convolution block; and finally, mapping the full connection layer to 6D dimension output, namely estimating 6D pose information.

The multi-scale intermediate internal space feature not only can be used as a space guide for generating object dense 3D point coordinates, but also can simplify the gesture estimation process together with the generated object dense 3D point coordinates. Since these multi-scale features characterize the details and state of the object from rough shape to precise coordinates, respectively, they help to roughly identify the pose before finely estimating it, forming a pre-estimation process.

Referring to fig. 5, fig. 5 is a schematic diagram of a point-side convolution block according to an exemplary embodiment.

Feature map F for input multiscale geometry supervision _{edge_in} ∈R ^78×64×64 With f=78-dimensional point x= { X having n=64×64 points ₁ ,…,x _n Characterizing feature point diagrams.

Calculate a directed graph g= (V, E) representing a local point cloud structure, where v= {1,..Respectively a vertex set and an edge set. G is constructed as a k-nearest neighbor (k-NN) map of X.

Wherein the graph includes self-loops, which means that each node points to itself as well.

Edge feature is defined as e _ij ＝h _Θ (x _i ,x _j ) Wherein h is _Θ :R ^F ×R ^F →R ^F′ Is a nonlinear function with a set of learnable parameters theta.

Edge convolution operations (using summation operations or maxima) are defined by applying a channel symmetric aggregation operation on the edge features associated with all edges from each vertex:

will x _i Regarded as the center pixel, will { x } _j (i, j) ∈ε } is considered as one neighbor around it. Given an F-dimensional point cloud with n points, edge convolution generates an F' dimensional point cloud with the same number of points.

x ₁ ,…,x _n Representing image pixels on a regular grid, and graph G has connectivity representing fixed-size image blocks around each pixel.

h is defined as

h _Θ (x _i ,x _j )＝h _Θ (x _i ,x _j -x _i ,x _j )

This will be explicitly defined by the coordinate x of the center of the image block _i Captured global shape structure and method of forming the same by x _j -x _i ，x _j The captured local neighborhood information is combined. The operator is defined as follows

By shared MLP implementation, while

Here Θ= (θ) ₁ ,…,θ _m ,φ ₁ ,…,φ _m ,γ ₁ ,…,γ _m )。

A pair-wise distance matrix is calculated in the feature space and then the nearest k-point is taken for each point.

The whole Edge-PnP pose estimation module is defined and updated through the following implementation, and finally a 6D pose result is obtained by estimating the Edge-PnP pose by a fully-connected (FC) network.

Defining a symbol: x= { X ₁ ,…,x _n The number of points is represented by the number of points, the dimension of the feature vector is represented by the number of neighbors is represented by the number of points, and the number of neighbors is represented by the number of points. For each point x _i Its k nearest neighbors are found, constructing a kNN graph G.

For each edge (x _i ,x _j ) Calculate its feature vector x _j -x _i ]Wherein x is _i And x _j Is a feature vector of both endpoints. This feature vector can capture the relative position and orientation between the two points, as well as the information of the two points themselves.

For each point x _i Splice it with feature vectors of all its neighbors to obtain a matrix M _i ＝[x _i ,x _j ,x _j -x _i ]Wherein e is _ij Is all and x _i Feature vectors of connected edges.

For each point x _i For its matrix M _i Applying an h function, and obtaining a new feature vector e 'through a multi-layer perceptron (MLP)' _ijm ＝MLP(M _i ). MLP is a fully connected neural network that can learn nonlinear transformations.

For each point x _i For all its neighborsMaximizing pooling (max-pooling) of new feature vectors to obtain an aggregate feature vector g _i ＝maxpool(e′ _ijm ). Maximum pooling is a dimension-reduction operation that can extract the most significant features.

For each point x _i Splice its own new feature vector and aggregate feature vector to obtain a final feature vector x' _i ＝[e′ _ijm ；g _i ]. This allows the information of the point itself and the neighborhood information to be preserved.

Therefore, the characteristic is processed by using the point-edge convolution diagram convolution block capable of efficiently processing the point data, the perception of the characteristic by a method can be improved extremely, and the prediction precision is improved.

The preprocessed image is subjected to direct regression to obtain 6D pose information by utilizing the full end-to-end pose estimation network, and the network is trained by utilizing the following loss function for training the end-to-end network:

L＝L _pose +L _rec +L _nor

wherein L is _pose Is the 6D pose loss. The 6D pose in the training process obtains a true value through the homemade dataset.

Three variables are regressed in the 6D pose information for 3D offset: t (T) _s ＝(dx，dy,t _z ) Dx, dy represents the offset from the center of the detection frame of the object to the center of the real object, which is not the absolute offset, but rather the training network to predict the relative offset. t is t _z Is the depth of the scaling.

Recovering 3D offsets in the predicted 6D pose by:

wherein f _x ,f _y Is the focal length of the monocular camera.

Reconstruction losses include: using L ₁ Loss of object dense point coordinates M to supervise reconstruction _rec And a segmentation mask M _mask And uses cross entropy loss CE to supervise the surface block map M _region 。

Only the pixel coordinate points belonging to the object visible in the object image block are considered,pixel points representing a dense graph of real objects, and I ⁱ Representing pixels of a reconstructed dense graph of the object, whereas M represents all pixel geometries of the object,/->I _seg ⁱ Representing the pixels of the true mask map and the pixels of the predictive mask map, respectively.

For the surface normal vector map, the cosine distance is used to measure the difference between the predicted surface normal vector map and the corresponding real object surface normal vector map, and L is used for the erosion mask map ₁ Loss.

Wherein N is ^(xy) ,Respectively representing a predicted target object surface normal vector diagram and a real surface normal vector diagram at coordinates (x, y).

The final regression is 6D pose information [ R, T ] _x ,T _y ,T _z ]. The 6D pose information is the position and the pose of the object relative to the camera, the representation forms are a rotation matrix and a translation vector, and the position and the pose are determined by the fixed difference value of the camera and an AUV self coordinate systemThe state is converted into a pose result of the target relative to the AUV.

Corresponding to the embodiment of the AUV near-end butt-joint pose estimation method based on dense point reconstruction, the application also provides an embodiment of an AUV near-end butt-joint pose estimation device based on dense point reconstruction.

Fig. 6 is a block diagram illustrating an AUV near-end docking pose estimation method based on dense point reconstruction according to an exemplary embodiment. Referring to FIG. 6, the apparatus includes

An acquisition unit 11 for acquiring an underwater image containing a docking target object through a built-in monocular camera of the AUV;

a preprocessing unit 12 for performing image preprocessing on the underwater image;

the pose estimation unit 13 is configured to input the preprocessed image into a near-end docking pose estimated deep learning model for pose estimation, where the near-end visual pose estimated deep learning model includes a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module, and a graph convolution-based Edge-PnP pose estimation module, where the target object detection module is configured to detect a target object in the preprocessed image and generate a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for estimating an AUV near-end docking pose based on dense point reconstruction as described above.

Correspondingly, the application also provides a computer readable storage medium, on which computer instructions are stored, which when executed by a processor, implement the AUV near-end docking pose estimation method based on dense point reconstruction.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. The AUV near-end docking monocular pose estimation method based on dense point reconstruction is characterized by comprising the following steps of:

acquiring an underwater image containing a docking station target object through a built-in monocular camera of the AUV;

performing image preprocessing on the underwater image;

inputting the preprocessed image into a near-end deep learning model for estimating the pose, wherein the near-end deep learning model for estimating the pose comprises a target object detection module, a dock internal target object dense point reconstruction module, a normal vector supervision module and a figure convolution-based Edge-PnP pose estimation module, and the target object detection module is used for detecting a target object in the preprocessed image and generating a target object image block; the dock station internal object dense point reconstruction module is used for extracting characteristics from the object block diagram by adopting an encoding-decoding network and reconstructing dense point coordinates of an object; the normal vector supervision module is used for recovering a surface normal vector diagram of an object from the target object block diagram; the Edge-PnP pose estimation module learns pose features from dense point coordinates of the target object and a surface normal vector diagram of the object by using a graph rolling network and estimates a final 6D pose.

2. The method of claim 1, wherein image preprocessing the underwater image comprises:

3. The method of claim 1, wherein the object detection module is configured to detect an object in the preprocessed image and generate an object image block, and comprises:

4. The method of claim 1, wherein the dock internal object dense point reconstruction module extracts features from the object block map using an encoding-decoding network and reconstructs dense point coordinates of an object, comprising:

5. The method of claim 1, wherein the normal vector supervision module is configured to recover a surface normal vector map of an object from the target block map, and comprises:

6. The method of claim 5, wherein the Edge-PnP pose estimation module utilizing a graph rolling network to learn pose features from dense point coordinates of the target object and a surface normal vector graph of the object and to estimate a final 6D pose comprises:

7. An AUV near-end docking monocular pose estimation device based on dense point reconstruction is characterized by comprising:

the preprocessing unit is used for preprocessing the underwater image;

8. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

9. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any of claims 1-6.