WO2023165093A1 - Procédé d'entraînement pour modèle d'odomètre inertiel visuel, procédé et appareils d'estimation de posture, dispositif électronique, support de stockage lisible par ordinateur et produit de programme - Google Patents

Procédé d'entraînement pour modèle d'odomètre inertiel visuel, procédé et appareils d'estimation de posture, dispositif électronique, support de stockage lisible par ordinateur et produit de programme Download PDF

Info

Publication number
WO2023165093A1
WO2023165093A1 PCT/CN2022/112430 CN2022112430W WO2023165093A1 WO 2023165093 A1 WO2023165093 A1 WO 2023165093A1 CN 2022112430 W CN2022112430 W CN 2022112430W WO 2023165093 A1 WO2023165093 A1 WO 2023165093A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
visual
sample color
image
color images
Prior art date
Application number
PCT/CN2022/112430
Other languages
English (en)
Chinese (zh)
Inventor
潘友琦
查红彬
刘浩敏
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023165093A1 publication Critical patent/WO2023165093A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • the embodiment of the present disclosure is based on the Chinese patent application with the application number 202210195781.9, the application date is March 1, 2022, and the application name is "Visual-inertial odometry model training method, pose estimation method and device", and requires the Chinese The priority of the patent application, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.
  • Embodiments of the present disclosure relate to the technical field of computer vision, and in particular to a training method of a visual-inertial odometry model, a pose estimation method, a device, electronic equipment, a computer-readable storage medium, and a program product.
  • Visual odometry is a sub-module in the visual SLAM (Simultaneous localization and mapping, simultaneous localization and map construction) problem. It uses two adjacent frames of pictures taken by the camera on the robot during its movement to calculate the relative position between the two frames. posture. Visual odometry only uses cameras as sensors, which are greatly affected by optical properties, such as illumination changes, moving objects, and non-textured areas.
  • Some existing methods use the inertial sensor IMU (Inertial Measure Unit, inertial measurement unit) as a visual supplement, design a visual inertial odometer, use the inertial sensor to measure the acceleration and angular velocity of the robot, and fuse the visual information to obtain a more robust estimated effect.
  • IMU Inertial Measure Unit, inertial measurement unit
  • the current visual-inertial odometry mostly adopts nonlinear optimization method to fuse camera and inertial sensor information in a tightly coupled form.
  • this method has the problems of complex initialization and calibration, time-consuming or divergent optimization iteration process, and tracking loss may occur (that is, the pose cannot be estimated).
  • the existing visual-inertial odometry using deep learning cannot restore the motion scale like the traditional method, on the other hand, it regards the visual and inertial parts as independent modules, resulting in poor pose estimation accuracy.
  • Embodiments of the present disclosure provide a training method for a visual-inertial odometry model, a pose estimation method, a device, electronic equipment, a computer-readable storage medium, and a program product, which improve the accuracy and robustness of the visual-inertial odometry model , thus improving the accuracy of pose estimation.
  • the first aspect of an embodiment of the present disclosure provides a training method for a visual-inertial odometry model
  • the training method for the visual-inertial odometer model includes: acquiring a sample image set and a sample IMU data set; wherein the sample image set includes Using several frames of continuous sample color images acquired by an image acquisition device, the sample IMU data set includes the corresponding sample IMU data obtained when acquiring the several frames of continuous sample color images; Two frames of sample color images and the corresponding sample IMU data between the two adjacent frames of sample color images are input into the visual inertial odometry model, and the two frames of depth images corresponding to the adjacent two frames of sample color images are output and The estimated pose when the image acquisition device acquires the two adjacent frames of sample color images; based on the two frames of depth images corresponding to the two adjacent frames of sample color images, the image acquisition device acquires the two adjacent frames The estimated pose of the frame sample color image and the corresponding sample IMU data between the two adjacent frame sample color images determine the target loss function of the visual
  • the sample image set and the sample IMU data set includes several frames of continuous sample color images acquired by an image acquisition device
  • the sample IMU data set includes the images obtained when several frames of continuous sample color images are acquired.
  • the visual-inertial odometry model can be used to estimate the scene depth and the pose of the image acquisition device, and output two adjacent frames of sample color
  • the two frames of depth images corresponding to the images and the estimated poses when the image acquisition device acquires two adjacent frames of sample color images can be based on the two frames of depth images corresponding to the adjacent two frames of sample color images, and the image acquisition device can obtain adjacent
  • the estimated pose of two frames of sample color images and the corresponding sample IMU data between two adjacent frames of sample color images are used to determine the target loss function of the visual-inertial odometry model, so the visual information and IMU information are fused in the network Together,
  • the visual-inertial odometry model includes a depth estimation network, a visual coding network, an IMU coding network and a visual-inertial fusion network; the two adjacent frames of sample color images in the sample image set and the two adjacent frames The corresponding sample IMU data between the sample color images is input to the visual inertial odometry model, and the two adjacent frames of depth images corresponding to the sample color images are output, and the image acquisition device acquires the two adjacent frames
  • the estimated pose of the sample color image includes: inputting the sample color image in the sample image set into the depth estimation network to obtain the depth image corresponding to the sample color image; and the previous frame of the sample image set The sample color image and the current frame sample color image are superimposed and input to the visual encoding network to obtain visual feature encoding; the corresponding sample IMU data between the previous frame sample color image and the current frame sample color image is input into the The IMU encoding network is used to obtain the IMU feature code; the visual feature code and the IMU feature code are input into the visual
  • the visual-inertial odometry model is composed of depth estimation network, visual coding network, IMU coding network and visual-inertial fusion network, and the depth image corresponding to the sample color image is obtained by inputting the sample color image in the sample image set into the depth estimation network, thus Realize the estimation of the depth map of the environment where the image acquisition device is located;
  • the visual feature encoding is obtained by superimposing the sample color image of the previous frame and the sample color image of the current frame in the sample image set and inputting it into the visual coding network, and at the same time, the sample color image of the previous frame
  • the corresponding sample IMU data between the color image and the current frame sample color image is input into the IMU encoding network to obtain the IMU feature encoding, and then the visual feature encoding and IMU feature encoding are input into the visual inertial fusion network, and the image acquisition device can obtain the current frame sample
  • the estimated pose of the color image so as to realize the estimation of the pose of the image acquisition device itself.
  • the depth estimation network includes an encoder and a decoder connected to each other; the inputting the sample color image in the sample image set into the depth estimation network to obtain the depth image corresponding to the sample color image includes: The sample color image is input into the depth estimation network, the sample color image is transformed into a depth feature map by the downsampling layer of the encoder, and the depth feature map is transformed by the upsampling layer of the decoder is the depth image corresponding to the sample color image.
  • the depth estimation network adopts an encoder-decoder structure, uses the downsampling layer of the encoder to transform the sample color image into a depth feature map, and then uses the upsampling layer of the decoder to convert the depth feature map
  • the map is transformed into a depth image corresponding to the sample color image, so that the depth map of the environment where the image acquisition device is located can be estimated using the framework of deep learning.
  • the visual-inertial fusion network adopts an attention mechanism
  • the visual-inertial fusion network includes a feedforward neural network
  • the visual feature code and the IMU feature code are input into the visual-inertial fusion network to obtain the
  • the estimated pose when the image acquisition device acquires the color image of the current frame sample includes: performing weighted fusion of the visual feature code and the IMU feature code through an attention mechanism to obtain an optimized feature code; using a feedforward neural network to The optimized feature code is processed to obtain the estimated pose when the image acquisition device acquires the sample color image of the current frame.
  • the visual feature code and the IMU feature code are weighted and fused through the attention mechanism to obtain the optimized feature code, and the feedforward neural network is used to process the optimized feature code, and the image acquisition device can obtain the current frame sample color image.
  • the attention mechanism pays attention to the complementarity of visual information and IMU information, that is, IMU information can provide better motion estimation for short-term fast motion, and visual information will not drift compared with IMU information, so in different scenes
  • the attention mechanism can effectively learn the relationship between visual features and inertial features, making the performance of the visual-inertial odometry model more robust in different scenarios.
  • the visual-inertial fusion network also includes a first multi-layer perceptron and a second multi-layer perceptron; the weighted fusion of the visual feature code and the IMU feature code through the attention mechanism is obtained to obtain an optimized feature code , comprising: inputting the IMU feature codes into the first multi-layer perceptron and the second multi-layer perceptron respectively to obtain several key-value pairs, each of which includes a key and a value; Obtain the similarity between the visual feature code and the key in each key-value pair, use the similarity as a weight, multiply the weight by the value in the corresponding key-value pair, and then sum to obtain the optimized feature coding.
  • the target loss function includes a depth loss function, a photometric loss function, and an IMU loss function; based on the two frames of depth images corresponding to the two adjacent frames of sample color images, the image acquisition device acquires the adjacent Determining the target loss function of the visual-inertial odometry model based on the estimated poses of two frames of sample color images and the corresponding sample IMU data between the two adjacent frames of sample color images, including: according to the previous frame The depth image corresponding to the sample color image and the depth image corresponding to the sample color image of the current frame are determined to determine the depth loss function; and the estimated pose when the image acquisition device acquires the sample color image of the current frame and the Determining the photometric loss function for the depth image corresponding to the sample color image of the current frame; The corresponding sample IMU data between frame sample color images is used to determine the IMU loss function.
  • the target loss function of the visual-inertial odometry model used includes the depth loss function, the photometric loss function and the IMU loss function, and the two frames before and after are calculated according to the estimated pose and depth.
  • the photometric difference and depth map difference after the image is transformed calculate the visual photometric error and geometric error, use the visual depth loss function and photometric loss function to constrain the depth estimation and pose estimation, and at the same time according to the kinematics of the IMU itself
  • the difference between the calculated result and the pose estimation result, and the calculation of the IMU error use two constraints of the IMU itself, namely, the velocity constraint and the position constraint, and associate the pose predicted by the network with the physical properties of the IMU, which can make the visual inertia
  • the odometry model training process converges faster and obtains absolute scale.
  • the second aspect of the embodiments of the present disclosure provides a pose estimation method
  • the pose estimation method includes: using an image acquisition device to acquire several consecutive frames of target color images, and determining that the image acquisition device acquires the consecutive frames of the several frames The corresponding target IMU data when the color image of the target; the several frames of continuous target color images and the corresponding target IMU data are input into the visual-inertial odometer model to obtain the estimated position when the image acquisition device acquires the target color image attitude; wherein, the visual-inertial odometry model is obtained by using the training method of the visual-inertial odometry model in the first aspect above.
  • the image acquisition device uses the image acquisition device to acquire several frames of continuous target color images, and determine the corresponding target IMU data when the image acquisition device acquires several frames of continuous target color images, by combining several frames of continuous target color images and the corresponding target IMU data
  • the visual-inertial odometer model uses the visual-inertial odometer model to obtain the estimated pose when the image acquisition device acquires the color image of the target. Since the visual-inertial odometry model is trained by the training method of the visual-inertial odometer model in the first aspect above, that is, the visual-inertial mileage
  • the calculation model integrates visual information and IMU information in the network, and uses the respective advantages of the two to obtain more accurate and robust pose estimation results.
  • the third aspect of the embodiment of the present disclosure provides a training device for a visual-inertial odometer model
  • the training device for the visual-inertial odometer model includes: a sample acquisition module, used to acquire a sample image set and a sample IMU data set; wherein, The sample image set includes several frames of continuous sample color images acquired by an image acquisition device, and the sample IMU data set includes corresponding sample IMU data obtained when acquiring the several frames of continuous sample color images; a processing module, It is used to input the two adjacent frames of sample color images in the sample image set and the corresponding sample IMU data between the two adjacent frames of sample color images into the visual-inertial odometry model, and output the two adjacent frames
  • the loss function determination module is used to determine the two adjacent frames of sample color images based on The corresponding two frames of depth images, the estimated pose when the image acquisition device acquires the two adjacent frames
  • the fourth aspect of the embodiments of the present disclosure provides a pose estimation device.
  • the pose estimation device includes: a data acquisition module, configured to use an image acquisition device to acquire several frames of continuous target color images, and determine the The corresponding target IMU data when acquiring the several frames of continuous target color images; the pose estimation module is used to input the several frames of continuous target color images and the corresponding target IMU data into the visual-inertial odometry model to obtain the described The estimated pose when the image acquisition device acquires the color image of the target; wherein, the visual-inertial odometry model is obtained by using the training method of the visual-inertial odometry model in the first aspect above.
  • the fifth aspect of the present disclosure provides an electronic device, including a memory and a processor coupled to each other, the processor is used to execute the program instructions stored in the memory, so as to realize the visual-inertial odometry model in the first aspect above
  • the sixth aspect of the embodiments of the present disclosure provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a processor, the training method of the visual-inertial odometry model in the above-mentioned first aspect is implemented, or The pose estimation method in the second aspect above.
  • the sample image set includes several frames of continuous sample color images acquired by an image acquisition device
  • the sample IMU data set includes several frames of continuous sample color images obtained when acquiring several frames of continuous sample color images.
  • the visual-inertial odometry model can be used to estimate the scene depth and the pose of the image acquisition device, and output two adjacent frames of samples
  • the two frames of depth images corresponding to the color images and the estimated poses when the image acquisition device acquires two adjacent frames of sample color images can be based on the two frames of depth images corresponding to the adjacent two frames of sample color images, and the image acquisition device acquires the corresponding
  • the estimated pose of two adjacent frames of sample color images and the corresponding sample IMU data between two adjacent frames of sample color images are used to determine the target loss function of the visual-inertial odometry model, so visual information and IMU information are combined in the network Combining and utilizing the respective advantages of the two, a more accurate and robust visual-inertial odometry model can be obtained; in addition, using the deep learning framework to realize the visual-inertial odometry, compared with the traditional BA (Bund
  • the seventh aspect of the embodiments of the present disclosure also provides a computer program product, including computer-readable codes, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on an electronic device, the The electronic device executes the steps of the device management method above.
  • Fig. 1 is a schematic flow chart of an embodiment of the training method of the disclosed visual-inertial odometer model
  • Fig. 2 is a schematic flow chart of an embodiment of step S12 in Fig. 1;
  • FIG. 3 is a schematic diagram of the principle of obtaining a depth image through a depth estimation network in an application scenario
  • Fig. 4 is a schematic flow chart of an embodiment of step S124 in Fig. 2;
  • Fig. 5 is a schematic diagram of the principle of obtaining an estimated pose through a visual-inertial fusion network in an application scenario
  • FIG. 6 is a schematic flow chart of an embodiment of step S1241 in FIG. 4;
  • FIG. 7 is a schematic diagram of the principle of obtaining key-value pairs through IMU feature encoding in an application scenario
  • Fig. 8 is a schematic flow chart of an embodiment of step S13 in Fig. 1;
  • Fig. 9 is a schematic diagram of the principle of the training process of the visual-inertial odometry model in an application scenario
  • FIG. 10 is a schematic flow diagram of an embodiment of the pose estimation method of the present disclosure.
  • Fig. 11 is a schematic diagram of the principle of pose estimation realized by the visual-inertial odometry model in an application scenario
  • Fig. 12 is a schematic frame diagram of an embodiment of a training device for a visual-inertial odometry model of the present disclosure
  • Fig. 13 is a schematic frame diagram of an embodiment of a pose estimation device of the present disclosure.
  • Fig. 14 is a schematic frame diagram of an embodiment of an electronic device of the present disclosure.
  • FIG. 15 is a block diagram of an embodiment of a computer-readable storage medium of the present disclosure.
  • system and “network” are often used interchangeably herein.
  • the term “and/or” in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.
  • the character "/” in this article generally indicates that the contextual objects are an “or” relationship.
  • “many” herein means two or more than two.
  • FIG. 1 is a schematic flowchart of an embodiment of a training method for a visual-inertial odometry model of the present disclosure.
  • the training method of the visual-inertial odometry model may include the following steps:
  • Step S11 Obtain a sample image set and a sample IMU data set.
  • the sample image set includes several frames of continuous sample color images acquired by an image acquisition device
  • the sample IMU data set includes corresponding sample IMU data acquired when the several frames of continuous sample color images are acquired.
  • the training device of the visual-inertial odometry model can use the image acquisition device to obtain the sample image set and the sample IMU data set, for example, during the movement of the image acquisition device, several frames of continuous sample color images can be obtained , as a sample image set; an inertial navigation device is provided on the image acquisition device or the moving equipment where the image acquisition device is located. During the process of the image acquisition device acquiring several frames of continuous sample color images, the inertial navigation device can simultaneously acquire the corresponding The sample IMU data for , as the sample IMU dataset.
  • Step S12 Input the two adjacent frames of sample color images in the sample image set and the corresponding sample IMU data between the two adjacent frames of sample color images into the visual-inertial odometry model, and output the adjacent two frames of sample color images.
  • the training device of the visual-inertial odometry model can select two adjacent frames of sample color images from several consecutive frames of sample color images in the sample image set, for example, select the sample color image of the current frame and the previous frame Sample color image, and then find the sample IMU data between the sample color image of the current frame and the sample color image of the previous frame from the sample IMU dataset. Since the sampling frequency of the IMU data is generally greater than the sampling frequency of the image, there may be multiple sets of sample IMU data corresponding to two adjacent frames of sample color images.
  • the selected adjacent two frames of sample color images and the adjacent two All the sample IMU data corresponding to the frame sample color images form the current batch of training data, and input the visual inertial odometry model, which can output the two frames of depth images corresponding to the adjacent two frame sample color images and the image acquisition device Estimated poses for two adjacent frames of sample color images.
  • Step S13 Based on the two frames of depth images corresponding to the two adjacent frames of sample color images, the estimated poses and poses when the image acquisition device acquires the two adjacent frames of sample color images, and the two adjacent frames of sample color images The corresponding sample IMU data between images is used to determine the target loss function of the visual-inertial odometry model.
  • Step S14 Using the target loss function, adjust the network parameters of the visual-inertial odometry model.
  • the training device of the visual inertial odometry model can calculate the photometric difference and depth map difference of the two frames before and after the transformation according to the estimated pose and depth, and obtain the visual photometric error and geometric Error; the training device of the visual-inertial odometry model can also calculate the IMU error according to the difference between the result calculated by the kinematics of the IMU itself and the pose estimation result, so as to determine the target loss function of the visual-inertial odometry model.
  • the training device of the visual-inertial odometry model can adjust the network parameters of the visual-inertial odometry model according to the target loss function, so as to update the visual-inertial odometry model.
  • the convergence of the target loss function can be obtained.
  • the update of the network parameters of the visual-inertial odometry model can be stopped; while in the target loss function In the case of non-convergence, the number of adjustments of network parameters can be obtained; when the number of adjustments reaches the preset number of times, the final visual-inertial odometry model can be determined according to the network parameters at this time, reducing the impact of loss function non-convergence on training efficiency Impact.
  • the sample image set includes several frames of continuous sample color images acquired by an image acquisition device
  • the sample IMU data set includes several frames of continuous sample color images acquired The corresponding sample IMU data obtained during image acquisition.
  • the visual-inertial odometry model can be used to estimate the scene depth and the pose of the image acquisition device, and obtain the corresponding The two frames of depth images corresponding to two adjacent frames of sample color images and the estimated poses when the image acquisition device acquires two adjacent frames of sample color images, so that the training device of the visual-inertial odometry model can be based on the adjacent two frames of sample color images.
  • the corresponding two frames of depth images, the estimated poses when the image acquisition device acquires two adjacent frames of sample color images, and all the sample IMU data corresponding between the two adjacent frames of sample color images are used to determine the visual-inertial odometry model.
  • the target loss function is used to train the visual-inertial odometry model according to the target loss function; that is, the training device of the visual-inertial odometry model can integrate visual information and IMU information in the network, and use the respective advantages of the two to obtain more Accurate and more robust visual-inertial odometry model; in addition, using a deep learning framework to implement visual-inertial odometry, compared with traditional nonlinear methods based on BA (Bundle-Adjustment, bundle adjustment), no complicated initialization and iteration are required The process and model are more concise. Compared with the traditional BA-based nonlinear optimization algorithm, it simplifies the initialization and optimization process, and reduces the tracking loss in complex scenes.
  • FIG. 2 is a schematic flowchart of an embodiment of step S12 in FIG. 1 .
  • the visual-inertial odometry model includes a depth estimation network, a visual coding network, an IMU coding network and a visual-inertial fusion network; the realization of the above step S12 may include the following steps:
  • Step S121 Input the sample color images in the sample image set into the depth estimation network to obtain the depth images corresponding to the sample color images.
  • the training device of the visual-inertial odometry model can input the sample color image of the previous frame into the depth estimation network to obtain the depth image corresponding to the sample color image of the previous frame; input the sample color image of the current frame into the depth estimation network , the depth image corresponding to the sample color image of the current frame can be obtained.
  • the input of the depth estimation network is the sample color image (RGB image) of the current frame
  • the size is H*W*3
  • the output of the depth estimation network is the network prediction corresponding to the sample color image of the current frame. Depth image, the size is H*W*1.
  • the depth estimation network includes an encoder and a decoder connected to each other; the above step S121 may include: inputting the sample color image into the depth estimation network, and using the downsampling layer of the encoder to The sample color image is transformed into a depth feature map, and then the depth feature map is transformed into a depth image corresponding to the sample color image by using an upsampling layer of the decoder.
  • Figure 3 is a schematic diagram of the principle of obtaining depth images through the depth estimation network in an application scenario.
  • the depth estimation network adopts an encoder-decoder structure, and the sample color image is input into the depth estimation network.
  • the color image is transformed into a feature map of H/64*W/64*1024 through the downsampling layer, and in the decoder (Decoder), the feature map is used to obtain a depth image with the same size as the sample color image by using the upsampling layer. Therefore, the framework of deep learning can be used to estimate the dense depth map of the environment where the image acquisition device is located.
  • Step S122 superimpose the sample color image of the previous frame and the sample color image of the current frame in the sample image set, and then input it into the visual coding network to obtain a visual feature code.
  • Step S123 Input the sample IMU data corresponding between the sample color image of the previous frame and the sample color image of the current frame into the IMU coding network to obtain the IMU feature code.
  • Step S124 Input the visual feature code and the IMU feature code into the visual-inertial fusion network to obtain an estimated pose when the image acquisition device acquires the sample color image of the current frame.
  • the visual coding network can use the image information of the sample color image to obtain the code containing pixel motion and camera motion information
  • the IMU coding network can use the IMU data between the sample color image of the previous frame and the sample color image of the current frame to obtain An encoding with the same number of channels as the output of the visual encoding network.
  • the input of the visual encoding network is the superimposition of the current frame sample color image and the previous frame sample color image, the size is H*W*6, and the output is the visual feature encoding, the size is 1*1024;
  • the input of the IMU encoding network is the current All IMU data between the frame sample color image and the previous frame sample color image, IMU data can include acceleration data and angular velocity data, when the frequency of the IMU is 10 times that of the image acquisition device, the input size of the IMU encoding network is 10* 6.
  • the output size is 10*1024.
  • the visual-inertial fusion network adopts a tight coupling method similar to traditional optimization, using the visual feature code output by the visual coding network and the IMU feature code output by the IMU coding network to obtain the fused code and finally output the estimated pose.
  • the above scheme uses a depth estimation network, a visual encoding network, an IMU encoding network, and a visual-inertial fusion network to form a visual-inertial odometry model, and obtains the depth image corresponding to the sample color image by inputting the sample color image in the sample image set into the depth estimation network.
  • the estimation of the depth map of the environment where the image acquisition device is located is realized;
  • the visual feature encoding is obtained by superimposing the sample color image of the previous frame and the sample color image of the current frame in the sample image set and inputting it into the visual coding network;
  • All sample IMU data corresponding to the sample color image and the current frame sample color image are input into the IMU coding network to obtain the IMU feature code, and then the visual feature code and the IMU feature code are input into the visual-inertial fusion network, and the image acquisition device can obtain the current The estimated pose of the frame sample color image, so as to realize the estimation of the pose of the image acquisition device itself.
  • FIG. 4 is a schematic flowchart of an embodiment of step S124 in FIG. 2 .
  • the visual-inertial fusion network adopts an attention mechanism, and the visual-inertial fusion network includes a feedforward neural network; the implementation of the above step S124 may include the following steps:
  • Step S1241 Perform weighted fusion of the visual feature code and the IMU feature code through an attention mechanism to obtain an optimized feature code.
  • Step S1242 Using a feed-forward neural network to process the optimized feature code to obtain an estimated pose when the image acquisition device acquires the sample color image of the current frame.
  • Figure 5 is a schematic diagram of the principle of obtaining estimated poses through the visual-inertial fusion network in an application scenario.
  • the input accepted by the visual-inertial fusion network is the output of the visual encoding network and the IMU encoding network, and the sizes are 1*1024 and 10*1024, the final output of the visual-inertial fusion network is a pose estimate with a size of 1*6, where 6 means predicting the relative pose of 6 degrees of freedom, including 3-dimensional translation vectors and 3-dimensional Euler angles.
  • the visual-inertial fusion network uses an attention mechanism to fuse visual features and inertial features. As an important part of the neural network structure, the attention mechanism can suppress useless features in the channel and enhance the features that need to be used.
  • Key, Value is called a key-value pair
  • Key is the key
  • the IMU feature code is split into 10 codes of 1*1024, and the similarity between the visual feature code (Visual Code) and each key (Key) is calculated, and the similarity is used as the weight to be multiplied by Each value (Value), and then add the results to get the optimized code (Refined Code) with a size of 1*1024, that is, the optimized feature code, and the optimized feature code passes through the feedforward neural network (Feed Forword), and finally obtains 1 *6 Pose.
  • FIG. 6 is a schematic flowchart of an embodiment of step S1241 in FIG. 4.
  • the visual-inertial fusion network further includes a first multi-layer perceptron and a second multi-layer perceptron;
  • the realization of the above step S1241 may include:
  • Step S12411 Input the IMU feature codes into the first multi-layer perceptron and the second multi-layer perceptron respectively to obtain several key-value pairs, each of which includes a key and a value.
  • Step S12412 Obtain the similarity between the visual feature code and the key in each key-value pair, use the similarity as a weight, multiply the weight by the value in the corresponding key-value pair, and then sum to obtain the Optimized feature encoding described above.
  • Figure 7 is a schematic diagram of the principle of obtaining key-value pairs through IMU feature encoding in an application scenario. Key-value pairs are formed by IMU feature encoding through different multilayer perceptrons (Multilayer Perceptron, MLP) or fully connected layers.
  • MLP Multilayer Perceptron
  • each of the key-value pairs includes a key and a value
  • the visual The similarity between the feature code and the key in each key-value pair, the similarity is used as the weight multiplied by the value in the corresponding key-value pair and then summed to obtain the optimized feature code; and then the image acquisition device can be obtained by using the optimized feature code
  • the estimated pose of the current frame sample color image therefore, by using a tightly coupled visual-inertial fusion network to fuse the visual feature encoding and the IMU feature encoding, compared to the existing loose coupling that directly splices the two encodings
  • the network can make full use of the accuracy of visual information and the high frequency of IMU information, so that visual information and IMU information can complement each other better.
  • Q represents visual feature encoding
  • d is a normalization parameter
  • Ki is the i-th key (Key)
  • Vi is the i-th value (Value).
  • the fusion method in the embodiment of the present disclosure makes full use of the accuracy of vision and the high frequency of IMU, and is a tightly coupled fusion method.
  • the visual feature code and the IMU feature code are weighted and fused through the attention mechanism to obtain the optimized feature code, and the optimized feature code is processed by the feedforward neural network, and the image acquisition device can obtain the color image of the current frame sample.
  • the attention mechanism pays attention to the complementarity of visual information and IMU information, that is, IMU information can provide better motion estimation for short-term fast motion, and visual information will not drift compared with IMU information, so in different In the scene, the attention mechanism can effectively learn the relationship between visual features and inertial features, making the performance of the visual-inertial odometry model more robust in different scenes.
  • the target loss function includes a depth loss function, a photometric loss function, and an IMU loss function; the implementation of the above step S13 may include the following steps:
  • Step S131 Determine the depth loss function according to the depth image corresponding to the sample color image of the previous frame and the depth image corresponding to the sample color image of the current frame.
  • Step S132 Determine the photometric loss function according to the estimated pose when the image acquisition device acquires the sample color image of the current frame and the depth image corresponding to the sample color image of the current frame.
  • Step S133 According to the estimated pose when the image acquisition device acquires the sample color image of the current frame and the corresponding sample IMU data between the sample color image of the previous frame and the sample color image of the current frame, determine the Describe the IMU loss function.
  • FIG 9 is a schematic diagram of the training process of the visual-inertial odometry model in an application scenario.
  • the visual-inertial odometry model includes a depth estimation network (DepthNet), a visual coding network (VisualOdom), and an IMU coding network (IMUOdom). And Visual Inertial Fusion Network (Sensor Fusion).
  • the depth image corresponding to the sample color image (I t-1 ) of the previous frame The depth image corresponding to the current frame sample color image (I t ) Determine the depth loss function (Loss geo ); where, In the current training process of the visual-inertial fusion network, the depth estimation of I t is obtained through the depth estimation network; that is to say, the visual-inertial fusion network has been obtained in the previous training process (Previous Output), which can be directly obtained during the current training process
  • the estimated pose (R, t) and the depth image corresponding to the sample color image of the current frame are acquired according to the image acquisition device Determine the photometric loss function (Loss pho ); according to the image acquisition device to obtain the estimated pose (R, t) of the current frame sample color image and all corresponding sample IMUs between the previous frame sample color image and the current frame sample color image Data (IMU t-1,t ), determine the IMU loss function (Loss imu ).
  • the IMU loss function may include a velocity
  • the target loss function of the visual-inertial odometry model used includes depth loss function, photometric loss function and IMU loss function.
  • the transformed photometric difference and depth map difference are used to calculate the visual photometric error and geometric error.
  • Depth estimation and pose estimation are constrained by visual depth loss function and photometric loss function, that is, the target loss function includes visual reprojection error.
  • the IMU error is calculated, using two constraints of the IMU itself, namely, the velocity constraint and the position constraint, and the pose predicted by the network and the physics of the IMU The properties are associated, which can make the visual-inertial odometry model training process converge faster and obtain absolute scale.
  • FIG. 10 is a schematic flow chart of an embodiment of the pose estimation method of the present disclosure.
  • the method may include the following steps:
  • Step S101 Use an image acquisition device to acquire several frames of continuous target color images, and determine the corresponding target IMU data when the image acquisition device acquires the several frames of continuous target color images.
  • Step S102 Input the several frames of continuous target color images and the corresponding target IMU data into the visual-inertial odometry model to obtain the estimated pose when the image acquisition device acquires the target color images.
  • the visual-inertial odometry model is obtained by using any one of the above-mentioned visual-inertial odometry model training methods.
  • Figure 11 is a schematic diagram of the principle of pose estimation through the visual-inertial odometry model in an application scenario, input the collected target color image and the corresponding target IMU data into the visual-inertial odometry model, and the visual-inertial odometry model Including depth estimation network, visual coding network, IMU coding network and visual-inertial fusion network, it can output dense depth map D and camera pose (R, t).
  • the image acquisition device uses the image acquisition device to acquire several frames of continuous target color images, and determining the corresponding target IMU data when the image acquisition device acquires several frames of continuous target color images, by combining several frames of continuous target color images with the corresponding
  • the target IMU data is input into the visual-inertial odometry model to obtain the estimated pose when the image acquisition device acquires the target color image. Since the visual-inertial odometry model is trained using the training method of the visual-inertial odometry model in the first aspect above, that is The visual-inertial odometry model integrates visual information and IMU information in the network, and uses the respective advantages of the two to obtain more accurate and robust pose estimation results.
  • the visual-inertial odometry model can be used to obtain the estimated pose of the image acquisition device when it collects continuous target color images, and the estimated pose of the image acquisition device can be obtained by connecting the continuous estimated poses. .
  • the subject of the pose estimation method may be a pose estimation device, for example, the pose estimation method may be executed by a positioning device or a server or other processing device, wherein the positioning device may be a robot, an unmanned vehicle , drones and other mobile devices, it can also be user equipment (User Equipment, UE), user terminal, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld device, computing device, vehicle-mounted device, wearable device, etc. .
  • the pose estimation method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 12 is a schematic frame diagram of an embodiment of a training device for a visual-inertial odometry model of the present disclosure.
  • the training device 120 of the visual-inertial odometry model includes: a sample acquisition part 1200 configured to acquire a sample image set and a sample IMU data set; Several frames of continuous sample color images, the sample IMU data set includes corresponding sample IMU data acquired when the several frames of continuous sample color images are acquired; the processing part 1202 is configured to collect the corresponding sample IMU data in the sample image set Two adjacent frames of sample color images and the corresponding sample IMU data between the two adjacent frames of sample color images are input to the visual inertial odometry model, and two frames of depth images corresponding to the two adjacent frames of sample color images are output and the estimated pose when the image acquisition device acquires the two adjacent frames of sample color images; the loss function determination part 1204 is configured to be based on the two frames of depth images corresponding to the two adjacent frames of
  • the sample image set and the sample IMU data set are acquired through the sample acquisition part 1200, wherein the sample image set includes several frames of continuous sample color images acquired by an image acquisition device, and the sample IMU data set includes several frames of continuous sample color images acquired.
  • the corresponding sample IMU data obtained when the image is obtained, the sample image set and the sample IMU data set are input into the vision in the processing part 1202, and the two frames of depth images corresponding to the adjacent two frames of sample color images are output, and the image acquisition device acquires two adjacent frames of depth images.
  • the estimated pose when the sample color image is framed so the loss function determination module 1204 can be based on the two frames of depth images corresponding to the adjacent two frame sample color images, the estimated pose when the image acquisition device acquires the adjacent two frame sample color images and All sample IMU data corresponding to two adjacent frames of sample color images are used to determine the target loss function of the visual-inertial odometry model. Therefore, the visual information and IMU information are fused in the network, and the respective advantages of the two can be used.
  • the visual-inertial odometry model includes a depth estimation network, a visual coding network, an IMU coding network, and a visual-inertial fusion network;
  • the processing part 1202 is also configured to input the sample color images in the sample image set
  • the depth estimation network obtains the depth image corresponding to the sample color image; and inputs the visual coding network after superimposing the previous frame sample color image and the current frame sample color image in the sample image set to obtain a visual feature code ;
  • the visual feature encoding and the IMU feature encoding Inputting the visual-inertial fusion network to obtain the estimated pose when the image acquisition device acquires the sample color image of the current frame.
  • the depth estimation network includes an encoder and a decoder connected to each other; the processing part 202 is further configured to input the sample color image into the depth estimation network, using the The down-sampling layer transforms the sample color image into a depth feature map, and then uses the up-sampling layer of the decoder to transform the depth feature map into a depth image corresponding to the sample color image.
  • the visual-inertial fusion network adopts an attention mechanism
  • the visual-inertial fusion network includes a feed-forward neural network
  • the processing part 1202 is also configured to encode the visual features and the The IMU feature code is weighted and fused to obtain an optimized feature code
  • the optimized feature code is processed by a feedforward neural network to obtain an estimated pose when the image acquisition device acquires the color image of the current frame sample.
  • the visual-inertial fusion network further includes a first multi-layer perceptron and a second multi-layer perceptron; the processing part 1202 is also configured to input the IMU feature codes into the first multi-layer respectively
  • the perceptron and the second multi-layer perceptron obtain several key-value pairs, each of which includes a key and a value; obtaining the visual feature encoding is similar to the key in each key-value pair degree, the similarity degree is used as a weight, and the weight is multiplied by the value in the corresponding key-value pair and summed to obtain the optimized feature code.
  • the target loss function includes a depth loss function, a photometric loss function, and an IMU loss function; the loss function determination part 1204 is further configured to use the depth image corresponding to the sample color image of the previous frame and the The depth image corresponding to the sample color image of the current frame, determining the depth loss function; and according to the estimated pose when the image acquisition device acquires the sample color image of the current frame and the depth image corresponding to the sample color image of the current frame, determining the photometric loss function; and according to the estimated pose when the image acquisition device acquires the sample color image of the current frame and the corresponding samples between the sample color image of the previous frame and the sample color image of the current frame IMU data, determine the IMU loss function.
  • FIG. 13 is a schematic frame diagram of an embodiment of a pose estimation device of the present disclosure.
  • the pose estimation device 130 includes: a data acquisition part 1300, configured to use an image acquisition device to acquire several frames of continuous target color images, and determine that the image acquisition device acquires the several frames of continuous target color images The color image corresponds to the target IMU data; the pose estimation part 1302 is configured to input the several frames of continuous target color images and the corresponding target IMU data into the visual-inertial odometry model to obtain the image acquisition device to obtain the The estimated pose of the target color image; wherein, the visual-inertial odometer model is obtained by using any of the above-mentioned visual-inertial odometer model training methods.
  • the data acquisition part 1300 uses the image acquisition device to acquire several frames of continuous target color images, and determines the corresponding target IMU data when the image acquisition device acquires several frames of continuous target color images.
  • the target color image and the corresponding target IMU data are input into the visual-inertial odometry model to obtain the estimated pose when the image acquisition device acquires the target color image.
  • the visual-inertial odometry model utilizes the visual-inertial odometry model of the first aspect above
  • the training method is obtained by training, that is, the visual-inertial odometry model integrates visual information and IMU information in the network, and uses the respective advantages of the two to obtain more accurate and robust pose estimation results.
  • FIG. 7 is a schematic diagram of an embodiment of an electronic device of the present disclosure.
  • the electronic device 140 includes a memory 141 and a processor 142 coupled to each other, and the processor 142 is used to execute the program instructions stored in the memory 141, so as to realize the training method of any of the above-mentioned visual-inertial odometry models, or any of the above-mentioned pose estimation method.
  • the electronic device 140 may include, but not limited to: a microcomputer and a server.
  • the processor 142 is used to control itself and the memory 141 to implement any of the above-mentioned training methods of the visual-inertial odometry model, or the steps in any of the above-mentioned embodiments of the pose estimation method.
  • the processor 142 may also be called a CPU (Central Processing Unit, central processing unit).
  • the processor 142 may be an integrated circuit chip with signal processing capability.
  • the processor 142 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field-programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the processor 142 may be jointly realized by an integrated circuit chip.
  • the processor 142 obtains a sample image set and a sample IMU data set, wherein the sample image set includes several frames of continuous sample color images acquired by an image acquisition device, and the sample IMU data set includes several frames of continuous sample color images acquired.
  • the corresponding sample IMU data obtained at the time after inputting the sample image set and the sample IMU data set into the visual-inertial odometry model, the visual-inertial odometry model can be used to estimate the scene depth and the pose of the image acquisition device, and output the adjacent
  • the two frames of depth images corresponding to the two frames of sample color images and the estimated poses when the image acquisition device acquires two adjacent frames of sample color images can be based on the two frames of depth images corresponding to the adjacent two frames of sample color images, image acquisition The device acquires the estimated poses of two adjacent frames of sample color images and all the sample IMU data corresponding between two adjacent frames of sample color images to determine the target loss function of the visual-inertial odometry model.
  • the visual information and the IMU The information is fused in the network, and by using the respective advantages of the two, a more accurate and robust visual-inertial odometry model can be obtained; in addition, the visual-inertial odometry model is implemented using a deep learning framework, compared with the traditional BA-based (Bundle -Adjustment, beam set adjustment) nonlinear method, without complex initialization and iteration process, the model is more concise, solves the problem of complex initialization and optimization in the traditional BA-based nonlinear optimization algorithm, and avoids appearing in complex scenes Track loss.
  • BA-based Blind -Adjustment, beam set adjustment
  • FIG. 15 is a schematic frame diagram of an embodiment of a computer-readable storage medium of the present disclosure.
  • the computer-readable storage medium 150 stores program instructions 1500 that can be executed by the processor, and the program instructions 1500 are used to implement any of the above-mentioned training methods for the visual-inertial odometry model, or the steps in any of the above-mentioned pose estimation method embodiments.
  • the disclosed methods and devices may be implemented in other ways.
  • the device implementations described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present disclosure is essentially or part of the contribution to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
  • the network structure of the visual-inertial odometer is more concise due to the use of the deep learning framework;
  • the estimated pose of two frames of sample color images and the corresponding sample IMU data between two adjacent frames of sample color images are used to determine the target loss function of the visual-inertial odometry model, and the visual-inertial odometry can be trained according to the target loss function.
  • the visual information and IMU information are fused in the network, and the respective advantages of the two are combined. While simplifying the initialization and optimization process, the accuracy and robustness of the visual-inertial odometry are improved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

La présente divulgation concerne un procédé d'entraînement pour un modèle d'odomètre inertiel visuel, un procédé et des appareils d'estimation de posture, un dispositif électronique, un support de stockage lisible par ordinateur et un produit de programme. Le procédé d'entraînement pour un modèle d'odomètre inertiel visuel comprend : l'entrée de deux trames adjacentes d'images de couleur d'échantillon dans un ensemble d'images d'échantillon et de données d'IMU d'échantillon correspondant aux deux trames adjacentes d'images de couleur d'échantillon dans un modèle d'odomètre inertiel visuel, et la sortie de deux trames d'images de profondeur correspondant aux deux trames adjacentes d'images de couleur d'échantillon et à une posture estimée lorsqu'un appareil de collecte d'image obtient les deux trames adjacentes d'images de couleur d'échantillon ; la détermination d'une fonction de perte objective du modèle d'odomètre inertiel visuel sur la base des deux trames d'images de profondeur correspondant aux deux trames adjacentes d'images de couleur d'échantillon, de la posture estimée lorsque l'appareil de collecte d'image obtient les deux trames adjacentes d'images de couleur d'échantillon, et des données d'IMU d'échantillon correspondantes entre les deux trames adjacentes d'images de couleur d'échantillon ; et l'ajustement de paramètres de réseau du modèle d'odomètre inertiel visuel à l'aide de la fonction de perte objective. Selon la solution précédente, un résultat optimal d'estimation de posture peut être obtenu.
PCT/CN2022/112430 2022-03-01 2022-08-15 Procédé d'entraînement pour modèle d'odomètre inertiel visuel, procédé et appareils d'estimation de posture, dispositif électronique, support de stockage lisible par ordinateur et produit de programme WO2023165093A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210195781.9A CN114612556A (zh) 2022-03-01 2022-03-01 视觉惯性里程计模型的训练方法、位姿估计方法及装置
CN202210195781.9 2022-03-01

Publications (1)

Publication Number Publication Date
WO2023165093A1 true WO2023165093A1 (fr) 2023-09-07

Family

ID=81861781

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/112430 WO2023165093A1 (fr) 2022-03-01 2022-08-15 Procédé d'entraînement pour modèle d'odomètre inertiel visuel, procédé et appareils d'estimation de posture, dispositif électronique, support de stockage lisible par ordinateur et produit de programme

Country Status (2)

Country Link
CN (1) CN114612556A (fr)
WO (1) WO2023165093A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197229A (zh) * 2023-09-22 2023-12-08 北京科技大学顺德创新学院 一种基于亮度对齐的多阶段估计单目视觉里程计方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612556A (zh) * 2022-03-01 2022-06-10 北京市商汤科技开发有限公司 视觉惯性里程计模型的训练方法、位姿估计方法及装置
CN115435790A (zh) * 2022-09-06 2022-12-06 视辰信息科技(上海)有限公司 一种视觉定位与视觉里程计位姿融合的方法及系统
CN115358962B (zh) * 2022-10-18 2023-01-10 中国第一汽车股份有限公司 一种端到端视觉里程计方法及装置
CN116681759B (zh) * 2023-04-19 2024-02-23 中国科学院上海微系统与信息技术研究所 一种基于自监督视觉惯性里程计的相机位姿估计方法
CN116704026A (zh) * 2023-05-24 2023-09-05 国网江苏省电力有限公司南京供电分公司 一种定位方法、装置、电子设备和存储介质
CN117058474B (zh) * 2023-10-12 2024-01-12 南昌航空大学 一种基于多传感器融合的深度估计方法及系统

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180031387A1 (en) * 2016-07-29 2018-02-01 Carnegie Mellon University State estimation for aerial vehicles using multi-sensor fusion
CN111311685A (zh) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 一种基于imu/单目图像的运动场景重构无监督方法
CN111369608A (zh) * 2020-05-29 2020-07-03 南京晓庄学院 一种基于图像深度估计的视觉里程计方法
CN112348854A (zh) * 2020-11-18 2021-02-09 合肥湛达智能科技有限公司 一种基于深度学习视觉惯性里程检测方法
CN112556692A (zh) * 2020-11-27 2021-03-26 绍兴市北大信息技术科创中心 一种基于注意力机制的视觉和惯性里程计方法和系统
CN112729294A (zh) * 2021-04-02 2021-04-30 北京科技大学 适用于机器人的视觉和惯性融合的位姿估计方法及系统
CN113091738A (zh) * 2021-04-09 2021-07-09 安徽工程大学 基于视觉惯导融合的移动机器人地图构建方法及相关设备
CN113221726A (zh) * 2021-05-08 2021-08-06 天津大学 一种基于视觉与惯性信息融合的手部姿态估计方法及系统
US20220036577A1 (en) * 2020-07-30 2022-02-03 Apical Limited Estimating camera pose
CN114612556A (zh) * 2022-03-01 2022-06-10 北京市商汤科技开发有限公司 视觉惯性里程计模型的训练方法、位姿估计方法及装置

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180031387A1 (en) * 2016-07-29 2018-02-01 Carnegie Mellon University State estimation for aerial vehicles using multi-sensor fusion
CN111311685A (zh) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 一种基于imu/单目图像的运动场景重构无监督方法
CN111369608A (zh) * 2020-05-29 2020-07-03 南京晓庄学院 一种基于图像深度估计的视觉里程计方法
US20220036577A1 (en) * 2020-07-30 2022-02-03 Apical Limited Estimating camera pose
CN112348854A (zh) * 2020-11-18 2021-02-09 合肥湛达智能科技有限公司 一种基于深度学习视觉惯性里程检测方法
CN112556692A (zh) * 2020-11-27 2021-03-26 绍兴市北大信息技术科创中心 一种基于注意力机制的视觉和惯性里程计方法和系统
CN112729294A (zh) * 2021-04-02 2021-04-30 北京科技大学 适用于机器人的视觉和惯性融合的位姿估计方法及系统
CN113091738A (zh) * 2021-04-09 2021-07-09 安徽工程大学 基于视觉惯导融合的移动机器人地图构建方法及相关设备
CN113221726A (zh) * 2021-05-08 2021-08-06 天津大学 一种基于视觉与惯性信息融合的手部姿态估计方法及系统
CN114612556A (zh) * 2022-03-01 2022-06-10 北京市商汤科技开发有限公司 视觉惯性里程计模型的训练方法、位姿估计方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197229A (zh) * 2023-09-22 2023-12-08 北京科技大学顺德创新学院 一种基于亮度对齐的多阶段估计单目视觉里程计方法
CN117197229B (zh) * 2023-09-22 2024-04-19 北京科技大学顺德创新学院 一种基于亮度对齐的多阶段估计单目视觉里程计方法

Also Published As

Publication number Publication date
CN114612556A (zh) 2022-06-10

Similar Documents

Publication Publication Date Title
WO2023165093A1 (fr) Procédé d'entraînement pour modèle d'odomètre inertiel visuel, procédé et appareils d'estimation de posture, dispositif électronique, support de stockage lisible par ordinateur et produit de programme
WO2020140431A1 (fr) Procédé et appareil de détermination de pose de caméra, dispositif électronique et support de stockage
CN110595466B (zh) 轻量级的基于深度学习的惯性辅助视觉里程计实现方法
CN111902826A (zh) 定位、建图和网络训练
WO2022206020A1 (fr) Procédé et appareil d'estimation de profondeur de champ d'image, et dispositif terminal et support de stockage
CN110533724B (zh) 基于深度学习和注意力机制的单目视觉里程计的计算方法
CN112907620B (zh) 相机位姿的估计方法、装置、可读存储介质及电子设备
CN113487608B (zh) 内窥镜图像检测方法、装置、存储介质及电子设备
CN111080699B (zh) 基于深度学习的单目视觉里程计方法及系统
CN112258565B (zh) 图像处理方法以及装置
JP2020008984A (ja) 自己位置推定装置、自己位置推定方法、自己位置推定プログラム、学習装置、学習方法及び学習プログラム
CN112991400B (zh) 一种无人艇的多传感器辅助定位方法
WO2023109221A1 (fr) Procédé et appareil de détermination de matrice d'homographie, support, dispositif, et produit-programme
US11398048B2 (en) Estimating camera pose
CN112907557A (zh) 道路检测方法、装置、计算设备及存储介质
WO2023140990A1 (fr) Odométrie inertielle visuelle avec profondeur d'apprentissage automatique
CN118247706A (zh) 基于微调标准模型的三维姿态估计方法、装置及存储介质
CN114140538B (zh) 车载相机位姿调整方法、装置、设备和计算机可读介质
Hu et al. Real-time camera localization with deep learning and sensor fusion
CN115435790A (zh) 一种视觉定位与视觉里程计位姿融合的方法及系统
JP6260533B2 (ja) 位置姿勢推定装置、位置姿勢推定方法および位置姿勢推定プログラム
CN113205530A (zh) 阴影区域处理方法及装置、计算机可读介质和电子设备
CN114659520B (zh) 位姿确定方法、位姿确定装置、介质与电子设备
CN117351306B (zh) 三维点云投影位姿求解器训练方法、确定方法及装置
CN113628279B (zh) 一种全景视觉slam建图方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22929525

Country of ref document: EP

Kind code of ref document: A1