CN117274373A

CN117274373A - Real-time monocular SLAM method and device represented by implicit nerve light field

Info

Publication number: CN117274373A
Application number: CN202311307483.5A
Authority: CN
Inventors: 华璟; 何雷; 孙杰
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2023-12-22

Abstract

The invention relates to a real-time monocular SLAM method and device represented by an implicit nerve light field. The invention adopts the camera positioning and map construction step-by-step processing modes to realize the real-time SLAM of the map stored by the implicit nerve light field under the monocular RGB camera; pre-training a fast reasoning model for estimating the pose and depth of a camera, and solving the problem that a monocular RGB camera cannot provide a positioning function and scene depth information for an SLAM system under the condition of no laser radar and gyroscope; the dense beam adjustment based on the optical flow estimation effectively optimizes the positions of the pose of the camera and map points, improves the accuracy of pose estimation and the stability of depth estimation, and enhances the robustness of the system; the implicit neural light field is utilized to store and render the scene map, so that the problems of discontinuous scene representation, limited resolution, unsmooth view angle switching and difficult illumination and texture modeling in the traditional method are solved at lower storage cost.

Description

Real-time monocular SLAM method and device represented by implicit nerve light field

Technical Field

The invention belongs to the technical field of synchronous positioning and map construction, and particularly relates to a real-time monocular SLAM method and device represented by an implicit nerve light field.

Background

With the continued advancement of Artificial Intelligence (AI) technology and the widespread penetration of applications, precise location information is urgently needed in a variety of fields, such as mobile robots, autonomous driving, unmanned aerial vehicles, augmented reality, etc., to support accurate navigation and decision-making. However, autonomous movement of these systems in complex, unknown environments faces the challenge of location awareness. Synchronous localization and mapping (SLAM) is an important technology that meets this need.

In an environment that is full of unknown and multifaceted factors, conventional SLAM techniques enable these autonomous systems to locate their own position immediately and accurately by integrating various sensor data, such as lidar, cameras, and inertial measurement units, while creating a three-dimensional map of the surrounding environment. Through the highly autonomous position sensing, the mobile robot, the unmanned aerial vehicle and other systems can plan paths, avoid barriers and execute tasks more accurately.

Conventional lidar sensor-based SLAMs face significant cost and resource loss in implementation. The high cost of lidar sensors, and the complex data processing and computational resources required, limit the application of this technology to a wide range.

In addition, conventional SLAM methods often employ point clouds or sparse feature points to outline the scene, however, this expression often has difficulty maintaining the characteristics of continuity and high resolution. On the one hand, these methods are very difficult to achieve seamless switching of viewing angles smoothly, i.e. it is difficult to make accurate predictions for those areas not captured by the sensor front, which is particularly critical in applications involving hidden parts or multi-angle viewing within the scene. On the other hand, in visual presentation of conventional SLAM methods, their reducing power for detail, texture and illumination changes of the scene is relatively weak, which may lead to an insufficiently vivid representation of the final image. Meanwhile, as the perception of the scene is mainly concentrated on specific feature points or sparse point clouds, the methods often cannot comprehensively represent the features of the whole scene.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a real-time monocular SLAM method and device represented by an implicit nerve light field.

The invention aims at realizing the following technical scheme:

according to a first aspect of the present specification, there is provided a real-time monocular SLAM method represented by an implicit neural light field, comprising the steps of:

Step S1, pre-training a fast reasoning network for estimating the pose and the inverse depth map of a camera from RGB images in a supervised learning mode by utilizing a video stream data set with the pose and the depth information of the camera, wherein the trained fast reasoning network comprises a positioning thread;

step S2, feeding the RGB image to the fast reasoning network trained in the step S1 in real time to acquire and maintain the camera pose of the RGB imageAnd inverse depth map->Two state variables;

step S3, based on the back projection model, the camera pose of the RGB image obtained in the step S2And inverse depth map->And mapping into three-dimensional spatial coordinates (X, Y, Z) in combination with known intra-camera parameters;

for RGB image I _t Is determined by a back projection model and depth valuesThe three-bit space voxel corresponding to the pixel point (u, v) is obtained relative to the camera position, and the coordinate calculation formula is as follows:

wherein the depth valueThe inverse depth map obtained in the step S2 corresponds to the inverse of the inverse depth of the pixel point;

and (u, v) is image I _t The coordinates of a pixel point in the image are the transverse coordinates and the longitudinal coordinates of the pixel point of the image respectively; (C) _x ，C _y ) C is the center coordinate of the image _x 、C _y Respectively the transverse and longitudinal coordinates of the center point of the image; (F) _x ，F _y ) F is the focal length of the camera intrinsic parameters _x 、F _y The length of the transverse focal length and the length of the longitudinal focal length of the camera lens are respectively; (X, Y, Z) is the relative coordinates of the voxels in three dimensions for the camera position, X, Y, Z is the coordinates in X, Y, Z axis direction relative to the camera position, respectively;

by coordinates (X, Y, Z) and current camera poseCalculating ray direction vector (x) of corresponding voxel of pixel point (u, v) _d ，y _d ，z _d )；

Sampling multiple coordinate points on the light ray direction path by taking the corresponding three-dimensional coordinates (X, Y, Z) of the image pixels as the center in a Gaussian distribution mode, and connecting the coordinate points in series to form a multi-coordinate direction vector with light ray direction information

Step S4, constructing a full-connection network of 80 layers to 100 layers, wherein the full-connection network obtains a multi-coordinate direction vector in the step S3As input, the RGB predicted values of the corresponding pixels are taken as output, and combined into the same size restored to the original image;

s5, taking errors of RGB real values and RGB predicted values with the same size recovered to the original image in the step S4 as loss functions, so as to perform real-time supervision training on the fully-connected network, wherein the fully-connected network for real-time training comprises a map construction thread and a display thread;

And S6, integrating the fast reasoning network trained in the step S1 and the fully-connected network trained in real time in the step S5 in a multithreading mode to form a complete SLAM system, wherein the SLAM system comprises the positioning thread, the map construction thread and the display thread.

Further, the step S1 specifically includes the following:

step S11, from the video stream data setExtracting image features and context features from the images of the image to obtain dense features;

step S12, calculating the correlation volumes between all pixel pairs of the adjacent two-frame images based on the image features in the dense features to describe the visual similarity of the pixel pairs, thereby obtaining a 4-dimensional tensor C for storing the correlation of all pixel pairs between the adjacent two-frame images _ij I.e. a correlation profile;

step S13, constructing a multi-layer correlation pyramid with different scales based on the correlation feature map obtained in the step S12, and searching for the 1-dimensional feature vector x of the required pixel pair from the multi-layer correlation pyramid _t ；

Step S14, based on the contextual characteristics obtained in the step S11 and the 1-dimensional characteristic vector x obtained in the step S12 _t Camera pose T corresponding to previous frame image _t And inverse depth map D _t Constructing an updating operator, dynamically updating the adjustment quantity of the camera pose and the correction quantity of the depth map, wherein the inverse depth value of the inverse depth map is the inverse of the depth value of the depth map, and the camera pose T _t E is SE (3); SE (3) represents a rigid body transformation in three-dimensional euclidean space, i.e. 6 degrees of freedom of object position and rotation; inverse depth mapRepresenting a positive real number 2-dimensional matrix of size H W;

step S15, the pose loss and the depth loss are combined to supervise and train the fast reasoning network.

Further, the step S11 specifically includes the following:

step S11a, the video stream data setTwo adjacent front and rear frames of image I _i Image I _j Sequentially feeding feature encoder g _θ Feature encoder g _θ Extracting image features at 1/8 resolution to achieve +.>Obtaining image characteristics; wherein the feature encoder g _θ The method comprises the steps of forming a convolutional neural network and a plurality of residual blocks, wherein D=256 is taken, and H and W represent the heights and widths of image pixels;

step S11b, image I _i Feeding context encoder h _θ Its structure and characteristic encoder g _θ The same, a contextual feature is obtained.

Further, the step S12 specifically includes the following:

computing a correlation volume of two frames of images from image features in the extracted dense features Representing a real 4-dimensional set of tensors of size H W;

correlation volume C _ij The correlation calculation formula of a pair of pixel points is as follows:

wherein,is I _i Middle pixel point (u) ₁ ，v ₁ ) And I _j Middle pixel point (u) ₂ ，v ₂ ) The correlation between the two is determined by the correlation between the two,and u and v are the transverse and longitudinal coordinates of the pixel point respectively.

Further, the step S13 specifically includes the following:

step S13a, for two adjacent frames of image I _i And image I _j Correlation volume C of (2) _ij Processing C with average pooling operations of kernel sizes 2, 4 and 8 _ij The resolution of the correlation volume is reduced to 1/2, 1/4 and 1/8 of the original resolution, so that a pyramid structure is formed;

step S13a, calculating an image I by using the estimated current camera pose and the inverse depth _i Middle pixel coordinate P _i Mapping to the next frame image I _j Pixel coordinate P of (2) _ij ，And according to P _ij Dynamically from a correlation volume C _ij Find the corresponding eigenvector in, to achieve this goal, introduce a related volume find operator L _r ：Taking a coordinate grid with the size of H multiplied by W as an input, searching a correlation volume from each layer of a correlation pyramid in a bilinear sampling mode by using a grid with the radius of r, and connecting search results of each layer into a final feature vector x _t Characterizing the pixel points to realize the correlation characteristics among the pixel points of the multi-scale and hierarchical steps, and simultaneously, obtaining the correlation characteristic vector x in the way _t Will also contain P _ij Visual similarity information of the vicinity;

image I _i Midpoint P _i Mapping to the next frame image I _j Pixel coordinate P of (2) _ij The calculation formula of (2) is as follows:

wherein pi (n) _c A camera model for projecting three-dimensional coordinate points onto a camera image;realizing inverse depth d as an inverse projection function _i And the coordinates P _i Mapping to three-dimensional point clouds; t (T) _i For image I _i Corresponding to the pose of the camera, T _j For image I _j And a predicted value corresponding to the pose of the camera.

Further, the step S14 specifically includes the following:

for updating camera pose T _t And inverse depth map D _t The update operator of (1) mainly comprises two core modules of ConvGRU and DAB, specifically:

a) ConvGRU module

The convolutional neural network CNN and the gate control loop unit GRU are combined, and information of time dimension and space dimension is considered at the same time, so that an optical flow P is realized _ij Is updated iteratively; the module involves two inputs: first, from C _ij Correlation feature vector x retrieved and concatenated in layers of correlation volume pyramid _t Second, by the pair of slave images I _i The extracted context feature map is subjected to additional convolution processing to obtain a context feature vector h ₀ The method comprises the steps of carrying out a first treatment on the surface of the The module finally outputs the hidden state as a corrected optical flow field delta P by using an additional convolution layer _ij Related confidence map w _ij To calculate the corrected pixel mapping relationAnd confidence map thereof->

CorrectedPixel mapping relationshipAnd confidence map thereof->The calculation formula of (2) is as follows:

b) Dense beam conditioning DBA module

Simultaneously adjusting internal and external parameters of the camera and spatial positions of feature points in the three-dimensional scene to minimize errors between projection positions of the feature points observed at a plurality of viewing angles on the image and actually detected positions, thereby optimizing maintenance of the pose T of the camera _i And the spatial location of the feature points, i.e. the DBA module will modify the optical flow fieldMapping the position and the pose of the camera and the updating quantity of the inverse depth;

the error between the projection position of the feature point on the image and the actually detected position is minimized, and the calculation formula of the pixel level is as follows:

wherein DeltaT _ij And Δd _i The updating amount of the pose and the inverse depth of the camera respectively; I.I _∑ For the mahalanobis distance, based on confidence weight w _ij Weighting the error term Σ _ij ＝diag(w _ij )；

In the solution process using the gaussian-newton algorithm, since the sparse nature of the hessian matrix H can be calculated with Schur's elimination to accelerate the calculation, the incremental linear equation hΔx=b can be changed into the following equation set form:

Wherein H is a hessian matrix, h=j ^T J, J is the image error L (T) ^* ，d ^* ) Jacobian matrix of (a); g is a block camera matrix containing camera pose related information; d is a diagonal block matrix corresponding to the pixel point, and comprises inverse depth related information of the pixel point, wherein the size of the inverse depth related information is far greater than G; e is a non-diagonal block matrix, which is related to specific observed data; Δζ is the update amount of the camera pose lie algebra; Δd is the update amount of the inverse depth of each pixel;

solving deltaxi in the linear equation set by using Gao Sixiao elements to obtain an increment equation about the pose part of the camera as follows:

(H/D)Δξ＝[G-ED ^-1 E ^T ]Δξ＝v-ED ^-1 w

wherein H/D is Schur complement of the Heisen matrix H relative to the faster matrix D of pixel pairs, and can be regarded as a reduced camera matrix;

and carrying the delta zeta obtained by solving into an original increment linear equation set, and calculating the inverse depth increment delta d of the pixel point, wherein the calculation formula is as follows:

Δξ＝[G-ED ^-1 E ^T ] ^-1 (v-ED ^-1 w)

Δd＝D ^-1 (w-E ^T Δξ)。

according to a second aspect of the present specification, there is provided a real-time monocular SLAM device represented by an implicit neural light field, comprising a memory and one or more processors, the memory having executable code stored therein, the processors, when executing the executable code, being adapted to implement the real-time monocular SLAM method represented by an implicit neural light field.

The beneficial effects of the invention are as follows:

1. according to the invention, an estimation model is constructed and trained, so that the implementation reasoning of corresponding camera pose and image depth from RGB images is realized, and the cost problem caused by the fact that the laser radar is needed as sensor equipment in the traditional SLAM is solved;

2. the invention adopts the related volume pyramid structure to represent the related volume of the pixels between two frames of images, and the invention focuses on the large-amplitude motion without losing the information of the small-amplitude motion, thereby improving the accuracy of the pose prediction of the camera;

3. the ConvGRU module is combined with the DBA module, the optical flow estimation is realized, and the camera pose and the image depth are estimated through the estimated optical flow, so that the cost of expensive sensors such as laser radar and the like required by SLAM is reduced, and the application range of SLAM technology is widened;

4. the invention applies the implicit nerve light field to SLAM technology, improves the expression capability of SLAM to construct the details, textures and rays of a scene map, effectively optimizes the position of the pose and map points of a camera by dense beam adjustment based on optical flow estimation, improves the accuracy of pose estimation and the stability of depth estimation, solves the problems caused by camera motion blurring and rapid motion, and enhances the robustness of a system; the implicit neural light field is utilized to store and render the scene map, so that the problems of discontinuous scene representation, limited resolution, unsmooth view angle switching and difficult illumination and texture modeling in the traditional method are solved at lower storage cost.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a real-time monocular SLAM method represented by an implicit neural light field, provided by an exemplary embodiment.

FIG. 2 is a flow chart for reasoning camera pose and depth in real time from acquired RGB images and storing a scene map using a NeLF network, as provided by an example embodiment;

FIG. 3 is a user perspective scene map rendering flow chart provided by an exemplary embodiment;

FIG. 4 is a block diagram of a real-time monocular SLAM device represented by an implicit nerve light field, provided by an exemplary embodiment.

Detailed Description

For a better understanding of the technical solutions of the present application, embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, based on the embodiments herein, which would be apparent to one of ordinary skill in the art without making any inventive effort, are intended to be within the scope of the present application.

As shown in fig. 1, in one embodiment, a real-time monocular SLAM method is provided that is represented by an implicit neural light field, comprising the steps of:

step S1, pre-training a fast reasoning network for estimating the pose and the inverse depth map of the camera from RGB images in a supervised learning mode by utilizing a video stream data set with the pose and the depth information of the camera, wherein the trained fast reasoning network comprises a positioning thread;

in one embodiment, the step S1 specifically includes the following:

step S11, from the video stream datasetExtracting image features and context features from the images of the image to obtain dense features;

specifically, step S11 specifically includes the following:

step S11a, video stream data setTwo adjacent front and rear frames of image I _i Image I _j Sequentially feeding feature encoder g _θ Feature encoder g _θ Extracting image features at 1/8 resolution to achieve +.> Obtaining image characteristics; wherein the feature encoder g _θ The method comprises the steps of forming a convolutional neural network and a plurality of residual blocks, wherein D=256 is taken, and H and W represent the heights and widths of image pixels; in one embodiment, the feature encoder g _θ Consists of one Conv7x7 (64) convolutional layer, 6 residual blocks and 3 downsampling layers;

Step S12, calculating the correlation volumes between all pixel pairs of the adjacent two frames of images based on the image features in the dense features to describe the visual similarity of the pixel pairs, thereby obtaining a 4-dimensional tensor C for storing the correlation of all pixel pairs between the adjacent two frames of images _ij I.e. a correlation profile;

specifically, step S12 specifically includes the following:

computing a correlation volume of two frames of images from image features in the extracted dense featuresRepresenting a real 4-dimensional set of tensors of size H W;

Step S13, constructing a multi-layer correlation pyramid with different scales based on the correlation feature map obtained in step S12, and searching for the 1-dimensional feature vector x of the required pixel pair _t ；

Specifically, step S13 specifically includes the following:

step S13a, for 4-dimensional correlation volume C _ij And (3) carrying out pooling treatment for 3 times to construct a related volume pyramid so as to obtain the related information under different resolutions. For two adjacent frames of image I _i And image I _j Correlation volume C of (2) _ij Processing C with average pooling operations of kernel sizes 2, 4 and 8 _ij The resolution of the correlation volume is reduced to 1/2, 1/4 and 1/8 of the original resolution, so that a pyramid structure is formed, the method is favorable for capturing characteristic information under different scales, the perception capability of multi-scale characteristics is improved, dynamic changes are more comprehensively understood in a multi-scale fusion mode, for example, high resolution is more sensitive to micro motions, and low resolution is more focused on larger-amplitude motions;

step S13b, calculating an image I by using the estimated current camera pose and the inverse depth _i Middle pixel coordinate P _i Mapping to the next frame image I _j Pixel coordinate P of (2) _ij ，And according to P _ij Dynamically from a correlation volume C _ij Find the corresponding eigenvector in, to achieve this goal, introduce a related volume find operator L _r ：Taking a coordinate grid with the size of H multiplied by W as input, retrieving a correlation volume from each layer of the correlation pyramid in a bilinear sampling manner by using a grid with the radius of r, and combining each layerThe secondary search results are connected into a final feature vector x _t Characterizing the pixel points to realize the correlation characteristics among the pixel points of the multi-scale and hierarchical steps, and simultaneously, obtaining the correlation characteristic vector x in the way _t Will also contain P _ij Visual similarity information of the vicinity;

Step S14, based on the contextual feature obtained in step S11 and the 1-dimensional feature vector x obtained in step S12 _t Camera pose T corresponding to previous frame image _t And inverse depth map D _t Constructing an updating operator, dynamically updating the adjustment quantity of the camera pose and the correction quantity of the depth map, wherein the inverse depth value of the inverse depth map is the inverse of the depth value of the depth map, and the camera pose T _t E is SE (3); SE (3) represents a rigid body transformation in three-dimensional euclidean space, i.e. 6 degrees of freedom of object position and rotation; inverse depth mapRepresenting a positive real number 2-dimensional matrix of size H W;

specifically, step S14 specifically includes the following:

for updating camera pose T _t And inverse depth map D _t The update operator of (1) mainly comprises ConvGRU (convolutional gate control unit) and DAB two core modules, specifically:

a) ConvGRU module

Corrected pixel mapping relationAnd confidence map thereof->The calculation formula of (2) is as follows:

the calculation formula of the gate control activation unit of the GRU unit lattice of the module is as follows:

z _t ＝σ(Conv _3×3 ([h _t-1 ，x _t ]，W _z ))

r _t ＝σ(Conv _3×3 ([h _t-1 ，x _t ]，W _r ))

wherein z is _t And r _t W is a parameter corresponding to a 3×3 convolution layer, h is an intermediate variable of the GRU unit cell _t Hidden vector for GRU, initial state h ₀ For the extracted image context feature;

b) Dense beam conditioning DBA module

wherein DeltaT _ij And Δd _i The updating amount of the pose and the inverse depth of the camera respectively; I.I _∑ For the mahalanobis distance, based on confidence weight w _ij For errorThe difference term being weighted, Σ _ij ＝diag(w _ij )；

solving deltaxi in the linear equation set by using Gao Sixiao elements, namely simultaneously multiplying deltad on two sides of the incremental linear equation set by left to obtain an incremental equation about the pose part of the camera as follows:

By calculation, the incremental equation for the camera pose part is obtained as follows:

(H/D)Δξ＝[G-ED ^-1 E ^T ]Δξ＝v-ED ^-1 w

Δξ＝[G-ED ^-1 E ^T ] ^-1 (v-ED ^-1 w)

Δd＝D ^-1 (w-E ^T Δξ)。

step S15, the pose loss and the depth loss are combined to monitor and train the fast reasoning network.

Specifically, step S15 specifically includes the following:

training the network model using pose loss and optical flow loss supervision, and back-propagating update parameters using Adam's algorithm, the model loss function predicting camera pose and image depthThe following are provided:

wherein,is an optical flow loss, which includes a loss of depth information; />For camera pose loss, T _i And->The real pose and the predicted pose are respectively.

Step S2, feeding the RGB image to the fast reasoning network trained in step S1 in real time to acquire and maintain the camera pose of the RGB imageAnd inverse depth map->Two state variables;

step S3, a visual angle is set, and a picture corresponding to the scene under the visual angle, namely an RGB image, is output. The RGB values for a pixel in an RGB image are obtained as follows:

Based on the back projection model, the camera pose of the RGB image obtained in the step S2 is calculatedAnd inverse depth map->And mapping into three-dimensional spatial coordinates (X, Y, Z) in combination with known intra-camera parameters;

for RGB image I _t Is determined by a back projection model and depth valuesTo obtain the position of the three-bit space voxel corresponding to the pixel point (u, v) relative to the camera, the coordinate calculation formula is as follows:

wherein the depth valueThe inverse depth map obtained in the step S2 corresponds to the inverse of the inverse depth of the pixel point; the inverse depth is the reciprocal of the depth value to play the roles of enhancing the stability of depth estimation, resisting occlusion and assisting in depth initialization;

and (u, v) is image I _t The coordinates of a pixel point in the image are the transverse coordinates and the longitudinal coordinates of the pixel point of the image respectively; (C) _x ，C _y ) C is the center coordinate of the image _x 、C _y Respectively the transverse and longitudinal coordinates of the center point of the image; (F) _x ，F _y ) Is a cameraInternal reference focal length, F _x 、F _y The length of the transverse focal length and the length of the longitudinal focal length of the camera lens are respectively; (X, Y, Z) is the relative coordinates of the voxels in three dimensions for the camera position, X, Y, Z is the coordinates in X, Y, Z axis direction relative to the camera position, respectively;

From the theoretical level, each ray can be uniquely determined by a direction vector in a three-dimensional space, however, in practical application, the problem of boundary of an object exists in an image, and the like, and the situation of 'input tiny change and output drastic fluctuation' can be caused by using the direction vector as input. For this purpose, it is necessary to enhance the input variability by sampling a plurality of coordinate points centered on three-dimensional coordinates (X, Y, Z) corresponding to image pixels in a gaussian distribution manner on the ray direction path and concatenating the coordinate points into a multi-coordinate direction vector with ray direction information

Step S4, a full-connection network of 80 layers to 100 layers, namely a NeLF network is constructed, and a scene map is stored in the form of an implicit nerve light field, as shown in FIG. 2. Wherein the fully connected network obtains a multi-coordinate direction vector in step S3As input, the RGB predicted values of the corresponding pixels are taken as output, and combined into the same size restored to the original image;

step S5, taking the error of the RGB real value and the RGB predicted value with the same size recovered to the original image in the step S4 as a loss function, so as to perform real-time supervision training on the fully-connected network, wherein the fully-connected network for real-time training comprises a map construction thread and a display thread;

In one embodiment, the step S5 specifically includes the following:

the loss function formula is as follows:

wherein, (x) _o ，y _o ，z _o ) Camera coordinates; (x) _d ，y _d ，z _d ) The pixel points of the image are represented to correspond to a 3-dimensional coordinate sequence, and R, G and B represent three colors of red, green and blue of the image.

And S6, integrating the fast reasoning network trained in the step S1 and the full-connection network trained in real time in the step S5 in a multithreading mode to form a complete SLAM system, wherein the SLAM system comprises a positioning thread, a map construction thread and a display thread.

Specifically, the SLAM system processes a frame stream input by a camera to complete real-time positioning and map construction, and simultaneously displays a constructed scene map picture on an output display in a visual angle interaction manner, and the SLAM system specifically comprises the following modules:

(1) The method comprises the steps of positioning a thread, receiving a new frame in real time, extracting image characteristics, predicting the pose of a camera and the image depth of a current frame through a pre-trained network model, and maintaining a frame diagram with a key frame as a set and an interframe consensus relationship as an edge;

(2) The composition thread takes the camera pose and the image depth calculated by the positioning thread as input, takes the current RGB image input by the camera as a label, and trains the NeLF network in real time to implicitly store the current scene map information;

(3) And the display thread is used for reasoning the picture presented by the scene map under the visual angle in real time according to the NeLF current parameters and the visual angle information interactively displayed by the user, and when the visual angle of the user is r at the moment, rendering the scene picture under the visual angle as shown in fig. 3.

According to the invention, an estimation model is constructed and trained, so that the implementation reasoning of corresponding camera pose and image depth from RGB images is realized, and the cost problem caused by the fact that the laser radar is needed as sensor equipment in the traditional SLAM is solved;

the invention adopts the related volume pyramid structure to represent the related volume of the pixels between two frames of images, and the invention focuses on the large-amplitude motion without losing the information of the small-amplitude motion, thereby improving the accuracy of the pose prediction of the camera;

the ConvGRU module is combined with the DBA module, the optical flow estimation is realized, and the camera pose and the image depth are estimated through the estimated optical flow, so that the cost of expensive sensors such as laser radar and the like required by SLAM is reduced, and the application range of SLAM technology is widened;

the invention applies the implicit nerve light field to SLAM technology, improves the expression capability of details, textures and rays of the SLAM for constructing the scene map, and solves the problem of discontinuous switching of the view angles of the scene map.

According to a second aspect of the present specification, there is provided a real-time monocular SLAM device represented by an implicit neural light field, comprising a memory having executable code stored therein and one or more processors, which when executing the executable code, are adapted to implement a real-time monocular SLAM method represented by an implicit neural light field.

Corresponding to the embodiments of the real-time monocular SLAM method represented by the implicit neural light field described above, the present invention also provides embodiments of a real-time monocular SLAM device represented by the implicit neural light field.

Referring to fig. 4, a real-time monocular SLAM apparatus represented by an implicit neural light field according to an embodiment of the present invention includes a memory and one or more processors, wherein executable codes are stored in the memory, and when the processor executes the executable codes, the processor is configured to implement the real-time monocular SLAM method represented by the implicit neural light field according to the above embodiment.

The embodiment of the real-time monocular SLAM device represented by the implicit nerve light field can be applied to any device with data processing capability, such as a computer or the like. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 4, a hardware structure diagram of an apparatus with arbitrary data processing capability where a real-time monocular SLAM device represented by an implicit neural light field is located in the present invention is shown in fig. 4, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 4, the apparatus with arbitrary data processing capability where the apparatus is located in an embodiment generally includes other hardware according to an actual function of the apparatus with arbitrary data processing capability, which is not described herein again.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the present invention also provides a computer readable storage medium having a program stored thereon, which when executed by a processor, implements the real-time monocular SLAM method represented by an implicit neural light field in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or memory, of any of the data processing enabled devices of any of the previous embodiments. The computer readable storage medium may be any external storage device of a device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a flash memory Card (F1 ash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing a computer program and other programs and data required by any device having data processing capabilities, and can also be used for temporarily storing data that has been output or is to be output.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.

The foregoing description is of preferred embodiments of the one or more embodiments of the present invention and is not intended to limit the one or more embodiments of the present invention to any modification which may be made within the spirit and principles of the one or more embodiments of the present invention.

Claims

1. A real-time monocular SLAM method represented by an implicit neural light field, comprising the steps of:

2. The real-time monocular SLAM method represented by an implicit neural light field according to claim 1, wherein step S1 specifically comprises the following:

step S12, calculating all pixel pairs of the adjacent two frames of images based on the image features in the dense featuresThe correlation volumes are used for describing the visual similarity of pixel pairs to obtain a 4-dimensional tensor C for storing the correlation of all pixel pairs between two adjacent frames of images _ij I.e. a correlation profile;

3. The real-time monocular SLAM method of claim 2, wherein step S11 specifically includes the following:

step S11a, the video stream data setTwo adjacent front and rear frames of image I _i Image I _j Sequentially feeding feature encoder g _θ Feature encoder g _θ Extracting image features at 1/8 resolution to achieve +.>Obtaining image featuresThe method comprises the steps of carrying out a first treatment on the surface of the Wherein the feature encoder g _θ The method comprises the steps of forming a convolutional neural network and a plurality of residual blocks, wherein D=256 is taken, and H and W represent the heights and widths of image pixels;

4. The real-time monocular SLAM method of claim 3, wherein step S12 specifically includes the following:

5. The real-time monocular SLAM method of claim 4, wherein step S13 specifically includes the following:

step S13a, for two adjacent frames of images I _i And image I _j Correlation volume C of (2) _ij Processing C with average pooling operations of kernel sizes 2, 4 and 8 _ij The resolution of the correlation volume is reduced to 1/2, 1/4 and 1/8 of the original resolution, so that a pyramid structure is formed;

step S13a, calculating an image I by using the estimated current camera pose and the inverse depth _i The middle pixel coordinates Pi are mapped to the next frame image I _j Pixel coordinate P of (2) _ij ，And according to P _ij Dynamically from a correlation volume C _ij Find the corresponding eigenvector in, to achieve this goal, introduce a related volume find operator L _r ：Taking a coordinate grid with the size of H multiplied by W as an input, searching a correlation volume from each layer of a correlation pyramid in a bilinear sampling mode by using a grid with the radius of r, and connecting search results of each layer into a final feature vector x _t Characterizing the pixel points to realize the correlation characteristics among the pixel points of the multi-scale and hierarchical steps, and simultaneously, obtaining the correlation characteristic vector x in the way _t Will also contain P _ij Visual similarity information of the vicinity;

wherein pi (n) _c To project three-dimensional coordinate points into a phase A camera model on the camera image;realizing inverse depth d as an inverse projection function _i And the coordinates P _i Mapping to three-dimensional point clouds; t (T) _i For image I _i Corresponding to the pose of the camera, T _j For image I _j And a predicted value corresponding to the pose of the camera.

6. The real-time monocular SLAM method of claim 5, wherein step S14 specifically includes the following:

a) ConvGRU module

Corrected pixel mapping relation And confidence map thereof->The calculation formula of (2) is as follows:

b) Dense beam conditioning DBA module

(H/D)Δξ＝[G-ED ^-1 E ^T ]Δξ＝v-ED ^-1 w

Δξ＝[G-ED ^-1 E ^T ] ^-1 (v-ED ^-1 w)

Δd＝D ^-1 (w-E ^T Δξ)。

7. a real-time monocular SLAM device represented by an implicit neural light field, comprising a memory and one or more processors, the memory having executable code stored therein, wherein the processor, when executing the executable code, is configured to implement the real-time monocular SLAM method represented by an implicit neural light field as claimed in any of claims 1-6.