CN116758161A

CN116758161A - Mobile terminal space data generation method and space perception mobile terminal

Info

Publication number: CN116758161A
Application number: CN202310755542.9A
Authority: CN
Inventors: 高健; 任轶; 刘明
Original assignee: Beijing Daoyi Shuhui Technology Co ltd
Current assignee: Beijing Daoyi Shuhui Technology Co ltd
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2023-09-15

Abstract

The application discloses a mobile terminal space data generation method and a space perception mobile terminal, which are used for solving the technical problem of high implementation cost caused by high space perception calculation complexity. The method comprises the steps of determining a key frame in a video stream, setting an marginalized sliding window and a local sliding window, and greatly reducing redundant image frames, so that the calculation complexity is reduced. And furthermore, the movable end with the monocular photography module, the inertial sensing module and the positioning module can also realize the space sensing capability, so that the implementation cost is reduced, and the potential of converting the space sensing technology into the productivity is further excited.

Description

Mobile terminal space data generation method and space perception mobile terminal

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a mobile terminal space data generation method and a space perception mobile terminal.

Background

Spatial perception refers to the positive process of knowing the surrounding environment and the relationship of the environment to itself through various functionalities. The machine equipment mainly depends on various sensing devices such as cameras, various inertial measurement units, satellite positioning modules and the like to realize space perception. Through the technical means, entities such as people, objects and the like in reality are converted into space data which can be calculated and used in a digital space.

The space perception capability has rich application scenes, such as unmanned, various robots, AR/VR equipment and the like, which are based on the space perception capability. In recent years, with the rapid development of a plurality of new technologies, application scenes of spatial perception capability become more and more important.

In implementing the prior art, the inventors found that:

the popularity of spatial perception techniques is limited. The reason for this is that the spatial perception technique has a high computational complexity, so that the requirement on the performance of the device is high, and the device depends on the special device to serve the special scene. These devices are expensive, proprietary, and complex in assembly procedures, severely binding and limiting the potential for converting space-aware technology into productivity.

Therefore, a new mobile terminal spatial data generation scheme is needed to solve the technical problem of high implementation cost caused by high complexity of spatial perception calculation.

Disclosure of Invention

The embodiment of the application provides a new mobile terminal space data generation scheme, which is used for solving the technical problem of high implementation cost caused by high space perception calculation complexity.

Specifically, the method for generating the space data of the mobile terminal is applied to the mobile terminal with an inertial sensing module, a monocular photography module and a positioning module, and comprises the following steps:

Based on an inertial sensing module of the mobile terminal, acquiring initial inertial pose information corresponding to a time sequence;

pre-integrating the initial inertial pose to obtain optimized inertial pose information;

based on a monocular photography module of a mobile terminal, acquiring a video stream corresponding to a time sequence;

adopting a sparse optical flow algorithm to determine the image characteristics of adjacent frames in the video stream;

generating initial visual pose information by adopting a motion structure recovery algorithm according to image characteristics of adjacent frames in a video stream;

loosely coupling the initial visual pose information and the optimized inertial pose information to obtain a calibration parameter value;

determining key frames in the video stream according to a key frame strategy;

according to the calibration parameter values and the key frames, an marginalized sliding window is configured so as to reduce the calculation complexity;

based on the marginalized sliding window, calculating accumulated integral residual errors between adjacent frame images;

based on the accumulated integral residual error, performing close-coupled nonlinear optimization on the initial visual pose information and the optimized inertial pose information to generate estimated pose information;

configuring a local sliding window according to the calibration parameter value, the feature descriptor extraction algorithm and the local closed-loop constraint;

based on the local sliding window, carrying out local loop detection on the estimated pose information to generate local estimated pose information;

Based on a positioning module of the mobile terminal, acquiring global track information corresponding to the time sequence;

establishing an association relation between global track information and local estimated pose information according to the time sequence, and generating sparse point cloud with position attribute;

inputting a video stream to a monocular depth estimation model to obtain image depth information;

and updating the sparse point cloud into a dense point cloud according to the image depth information, and taking the dense point cloud as space data.

Further, the determining the key frame in the video stream according to the key frame policy specifically includes:

tracking image features of adjacent frames in the video stream by adopting a sparse optical flow algorithm;

when the number of the image characteristic points of the adjacent frames in the video stream is smaller than a first preset threshold, taking the image frames with later time sequences in the adjacent frames as first key frames;

calculating average parallax between the image frames after the first key frame time sequence and the first key frames;

and when the average parallax is larger than a second preset threshold value, determining the image frames corresponding to the time sequence as second key frames.

Further, the marginalized sliding window is configured to:

recording visual pose information corresponding to a first number of key frames;

acquiring an image frame after a key frame time sequence;

When the image frame after the time sequence is a new key frame, recording visual pose information corresponding to the new key frame, and marginalizing the forefront key frame of the time sequence and the visual pose information corresponding to the forefront key frame;

when the image frames after the time sequence are non-key frames, recording the optimized inertial pose information corresponding to the non-key frame time sequence.

Further, the local sliding window is configured to:

recording key frames meeting local closed loop constraint;

calculating the similarity between the key frame and the adjacent frame by adopting a feature description sub-extraction algorithm;

and deleting the key frames with the similarity with the adjacent frames exceeding a third preset threshold value.

Further, the mobile terminal comprises at least one mobile terminal of a smart phone, a smart watch or a portable computer.

The embodiment of the application also provides a space perception mobile terminal.

Specifically, a space-aware mobile terminal includes:

the inertial sensing module is used for acquiring initial inertial pose information corresponding to the time sequence;

the monocular photography module is used for collecting video streams corresponding to the time sequence;

the positioning module is used for acquiring global track information corresponding to the time sequence;

the space data generation module is used for pre-integrating the initial inertial pose to obtain optimized inertial pose information; the method is also used for determining the image characteristics of adjacent frames in the video stream by adopting a sparse optical flow algorithm; the method is also used for generating initial visual pose information by adopting a motion structure recovery algorithm according to the image characteristics of adjacent frames in the video stream; the method is also used for loosely coupling the initial visual pose information and the optimized inertial pose information to obtain calibration parameter values; the method is also used for determining key frames in the video stream according to a key frame strategy; the method is also used for configuring an marginalized sliding window according to the calibration parameter value and the key frame so as to reduce the calculation complexity; the method is also used for calculating accumulated integral residual errors between adjacent frame images based on the marginalized sliding window; the method is also used for carrying out close-coupled nonlinear optimization on the initial visual pose information and the optimized inertial pose information based on the accumulated integral residual error to generate estimated pose information; the method is also used for configuring a local sliding window according to the calibration parameter value, the feature descriptor extraction algorithm and the local closed-loop constraint; the method is also used for carrying out local loop detection on the estimated pose information based on the local sliding window to generate local estimated pose information; the method is also used for establishing an association relation between global track information and local estimated pose information according to the time sequence, and generating sparse point cloud with position attribute; the method is also used for inputting a video stream to the monocular depth estimation model to obtain image depth information; and updating the sparse point cloud into a dense point cloud according to the image depth information, and taking the dense point cloud as space data.

Further, the spatial data generating module is configured to determine a key frame in the video stream according to a key frame policy, and specifically configured to:

Further, the marginalized sliding window is configured to:

acquiring an image frame after a key frame time sequence;

Further, the local sliding window is configured to:

Recording key frames meeting local closed loop constraint;

Further, the space perception mobile terminal comprises at least one mobile terminal of a smart phone, a smart watch or a portable computer.

The technical scheme provided by the embodiment of the application has at least the following beneficial effects:

by determining key frames in the video stream, setting an marginalized sliding window and a local sliding window, redundant image frames are greatly reduced, thereby reducing computational complexity. And furthermore, the movable end with the monocular photography module, the inertial sensing module and the positioning module can also realize the space sensing capability, so that the implementation cost is reduced, and the potential of converting the space sensing technology into the productivity is further excited.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a flow chart of a method for generating mobile terminal spatial data according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a spatial aware mobile terminal according to an embodiment of the present application.

The reference numerals in the drawings are as follows:

100. space perception mobile terminal

11. Monocular photography module

12. Inertial sensing module

13. Positioning module

14. Spatial data generation module

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, in order to solve the technical problem of high implementation cost caused by high complexity of space perception calculation, the application provides a method for generating space data of a mobile terminal, which is applied to the mobile terminal with a monocular photography module, an inertial sensing module and a positioning module. It can be understood that the mobile terminal with the monocular photography module, the inertial sensing module and the positioning module is at least one mobile terminal of a smart phone, a smart watch or a portable computer in a specific application scene.

The inventor finds that as the production technology of the mobile terminal tends to mature, a monocular photography module, an inertial sensing module and a positioning module are usually arranged on consumer-level hardware equipment, so that measurement data required by space perception can be obtained. In other words, if the space perception capability is given to the consumer-level hardware device, the implementation cost of the space perception can be reduced, and the space perception technology is motivated to be converted into the potential of productivity.

Specifically, the method for generating the mobile terminal space data comprises the following steps:

s110: based on an inertial sensing module of the mobile terminal, initial inertial pose information corresponding to the time sequence is acquired.

S120: and pre-integrating the initial inertial pose to obtain optimized inertial pose information.

It is understood that the inertial sensing module may be embodied as a gyroscope, an acceleration sensor, etc. in a specific application scenario. Generally, the inertial sensing module can directly measure the angular speed and acceleration of the mobile terminal when moving as initial inertial pose information, and the response speed is high. Of course, in a preferred embodiment provided by the application, the angular velocity and the acceleration of the mobile terminal during movement are recorded according to the time stamp, and the angular velocity and the acceleration are arranged in time series to form the initial inertial pose information.

It should be noted that, because the inertial sensing module of the mobile terminal of the present application is difficult to ensure higher accuracy, the inertial sensing module is easily affected by offset and noise when measuring initial inertial pose information, thereby causing pose divergence.

Therefore, pre-integration is also needed for the initial inertial pose to obtain optimized inertial pose information.

It should be further noted that, whether the inertial pose information is initial or optimized, the coordinate system corresponding to the first timestamp is taken as the current coordinate system, and the pose thus estimated is the pose relative to the coordinate system corresponding to the first timestamp, and not the pose relative to the world coordinate system.

It is also emphasized that even if the initial inertial pose is pre-integrated, optimized inertial pose information is obtained, which has significant drift. And drift may accumulate as time sequences increase. The use of only optimized inertial pose information as a basis for spatial perception is highly unreliable. For example, even if the mobile terminal is fixed in place, the obtained optimized inertial pose information has drift after pre-integrating the initial inertial pose information.

Therefore, the inventor introduces visual pose information which does not drift to align with the optimized inertial pose information so as to overcome the defect that the optimized inertial pose information has drift.

S130: based on a monocular photography module of the mobile terminal, video streams corresponding to the time sequence are collected.

S140: and determining the image characteristics of adjacent frames in the video stream by adopting a sparse optical flow algorithm.

S150: and generating initial visual pose information by adopting a motion structure recovery algorithm according to the image characteristics of adjacent frames in the video stream.

It can be appreciated that the monocular photography module appears as a monocular lens in a specific application scenario. The video stream is made up of a number of image frames. Typically, the image frames have a time stamp. And arranging a plurality of image frames in a time sequence according to the time stamp corresponding to the image frame, namely forming the video stream.

Optical flow is the instantaneous velocity of the pixel motion of a spatially moving object on the viewing imaging plane. The sparse optical flow algorithm is an algorithm for finding out the corresponding relation between the previous frame and the current frame by utilizing the change of pixels in an image sequence in a time domain and the correlation between adjacent frames, so as to calculate the motion information of an object between the adjacent frames.

The method for determining the image characteristics of the adjacent frames in the video stream by adopting the sparse optical flow algorithm specifically comprises the following steps:

image features of each image frame are tracked by adopting a sparse optical flow algorithm so as to keep the minimum number of the features in each image frame to be 100-300 corner features. While a minimum pixel spacing between two adjacent features is set to force uniform feature distribution.

The motion structure restoration algorithm is represented as a Structure From Motion (SFM) algorithm in a specific application scene and is used for extracting visual pose information from image features of adjacent frames in a video stream.

In a specific application scenario provided by the application, the implementation process of generating the initial visual pose information by adopting a motion structure recovery algorithm according to the image characteristics of adjacent frames in the video stream is as follows:

and performing epipolar constraint and triangulation according to the image characteristics of adjacent frames in the video stream, and recovering the three-dimensional information of the characteristic objects.

It is worth noting that the visual pose information does not substantially drift.

However, when the image frame changes, it is essentially unknown whether the mobile terminal itself moves or the external condition changes, so that it is difficult to process the dynamic obstacle with the pure visual pose information. Meanwhile, since the accuracy of triangulation based on visual feature points and the inter-frame displacement are related, when the monocular photography module performs approximate rotational motion, the triangulation may be degraded, which may lead to loss of image feature tracking. And motion structure restoration algorithms typically take the first frame as the current coordinate system, such that the estimated pose is relative to the pose of the first frame image, and not relative to the world coordinate system.

In summary, the depth of the initial visual pose information is not an actual physical scale, which results in feature objects in the initial visual pose information that, although in shape, are not the actual size.

S160: and performing loose coupling on the initial visual pose information and the optimized inertial pose information to obtain calibration parameter values.

It should be pointed out again that the initial visual pose information is not subject to drift but is affected by dynamic obstacles, and the optimized inertial pose information is not affected by dynamic obstacles but is subject to accumulated drift. Intuitively, the two have good complementarity. Meanwhile, the coordinate system corresponding to the initial visual pose information and the coordinate system corresponding to the optimized inertial pose information are not world coordinate systems.

Therefore, loose coupling is needed to be carried out on the initial visual pose information and the optimized inertial pose information to obtain calibration parameter values so as to finish initialization setting. The loose coupling is to directly fuse the initial visual pose information and the optimized inertial pose information, the fusion process does not affect the initial visual pose information and the optimized inertial pose information, and the initial visual pose information and the optimized inertial pose information are generally fused through an EKF and output as a post-processing mode. Specifically, initial visual pose information and optimized inertial pose information under the same time stamp are obtained through a calibration plate to align pose sequences, and calibration parameter values such as rotational displacement and time delay between coordinate systems are calculated. And the position and the posture of the image frame and the position of the image feature at the last moment in the image of the next frame can be well predicted by optimizing the inertial pose information, so that the matching speed of the feature tracking algorithm and the robustness of the algorithm for coping with the rapid rotation are improved. And finally, optimizing a gravity vector provided by the acceleration measured in the inertial pose information, and constructing the connection between the current coordinate system and the world coordinate system. Based on the association of the current coordinate system and the world coordinate system, the estimated position may be translated into a position in the world coordinate system.

It should be emphasized that, in terms of hardware implementation, when the mobile terminal is represented as a mobile terminal such as a smart phone, a smart watch or a portable computer, the inertial sensing module may also be directly embedded into the monocular photography module circuit board, so as to provide a low-cost and high-performance space sensing scheme.

It is also emphasized that the above only proposes implementation possibilities of a spatial awareness scheme on a hardware implementation. It is also desirable to reduce the computational complexity of the spatial awareness technique on the basis of limited computational power at the mobile end. The implementation scheme for reducing the computational complexity of the spatial awareness technology is detailed in the following steps.

S170: and determining key frames in the video stream according to the key frame strategy.

It will be appreciated that the present application determines key frames in a video stream according to a key frame policy to reduce redundant image frames and thus reduce computational complexity.

In one embodiment of the present application, the determining a key frame in a video stream according to a key frame policy specifically includes:

And when the number of the image feature points of the adjacent frames in the video stream is smaller than a first preset threshold, taking the image frames with later time sequences in the adjacent frames as the first key frames, so that the complete loss of the feature track can be avoided. When the average parallax is larger than a second preset threshold value, determining the image frames corresponding to the time sequence as second key frames, and avoiding the loss of tracking image characteristics caused by triangulation degradation when the monocular photography module performs approximate rotational motion.

S180: and configuring an marginalized sliding window according to the calibration parameter values and the key frames so as to reduce the computational complexity.

It will be appreciated that to further limit computational complexity, the inventors introduced an marginalized sliding window to selectively marginalize the optimized inertial pose information and the estimated depth in the sliding window, while converting the marginalized optimized inertial pose information and the estimated depth to a priori. In order to maintain the estimated sparsity, the marginalized sliding window does not marginalize the optimized inertial pose information corresponding to the non-key frames.

Further, in a specific embodiment provided by the present application, the marginalized sliding window is configured to:

acquiring an image frame after a key frame time sequence;

Therefore, the calculation amount is reduced, meanwhile, enough image features are ensured to perform triangulation, and the accuracy of acceleration is ensured to the greatest extent.

S190: based on the marginalized sliding window, an accumulated integration residual between adjacent frame images is calculated.

S200: based on the accumulated integral residual error, the initial visual pose information and the optimized inertial pose information are subjected to close-coupled nonlinear optimization, and estimated pose information is generated.

It will be appreciated that while marginalized sliding windows limit computational complexity, they also introduce cumulative drift. To eliminate the accumulated drift, the accumulated integration residual between adjacent frame images needs to be calculated based on an marginalized sliding window.

And then, based on the accumulated integral residual error, performing close-coupled nonlinear optimization on the initial visual pose information and the optimized inertial pose information, and finally generating estimated pose information.

The tight coupling nonlinear optimization is a process of placing optimized inertial pose information and initial visual pose information in an optimized filter, adding image features into feature vectors, constructing a motion equation and an observation equation together, and then performing state estimation to finally obtain pose information.

S210: and configuring a local sliding window according to the calibration parameter value, the feature descriptor extraction algorithm and the local closed-loop constraint.

S220: and carrying out local loop detection on the estimated pose information based on the local sliding window, and generating local estimated pose information.

It will be appreciated that to further limit the computational complexity, the inventors introduced a local sliding window to further screen and save key frames. Specifically, a local sliding window is configured according to the calibration parameter value, the feature descriptor extraction algorithm and the local closed-loop constraint. The feature descriptor extraction algorithm is expressed as a BRIEF descriptor in a specific application scene and is used for calculating the similarity between a key frame and an adjacent frame. The partial sliding window is configured to:

Recording key frames meeting local closed loop constraint;

And then, based on the local sliding window, adding constraint of local loop detection to the estimated pose information, and performing nonlinear optimization to generate more accurate local estimated pose information.

S230: based on a positioning module of the mobile terminal, global track information corresponding to the time sequence is acquired.

It can be appreciated that the positioning module is represented as a GNSS sensor in a specific application scenario, and may directly measure track information of the mobile terminal when the mobile terminal moves in the world coordinate system. The track information is composed of a plurality of coordinate points. Typically, the coordinate points have time stamps. And arranging a plurality of coordinate points in a time sequence according to the time stamp corresponding to the coordinate point, namely forming the track information.

S240: and establishing an association relation between the global track information and the local estimated pose information according to the time sequence, and generating a sparse point cloud with position attribute.

It should be noted that the local estimated pose information can provide accurate pose information in a small area, but since there is no participation of the world coordinate system, the local estimated pose information takes the first frame as the current coordinate system, and thus the estimated pose is the pose with respect to the first frame image, not the pose with respect to the world coordinate system. Thus, even if the mobile terminal performs spatial perception in the same environment, when the mobile terminal issues from different points, different local estimated pose information may be obtained.

Second, due to the lack of global measurement, locally estimated pose information is prone to cumulative drift over long runs. While local loop-back detection is proposed above to eliminate drift, it is not possible to handle large-scale environments with massive amounts of data.

In a preferred embodiment of the present application, the mobile terminal further establishes an association relationship between the global track information and the local estimated pose information according to the time sequence. And then performing nonlinear optimization on the data fused by the global track information and the local estimated pose information by using sparse matrix Cholesky decomposition, and generating a sparse point cloud with position attribute.

S250: and inputting the video stream to a monocular depth estimation model to obtain image depth information.

S260: and updating the sparse point cloud into a dense point cloud according to the image depth information, and taking the dense point cloud as space data.

It can be appreciated that in order to adapt to the performance limitation of the mobile terminal device, the application adopts a sparse optical flow algorithm for feature tracking. Therefore, even if the global track information and the local estimated pose information are aligned with the world coordinate system, the optimization result is still a sparse point cloud. Further, sparse point clouds are still not actual physical dimensions. For this purpose, it is also necessary to input a video stream to the monocular depth estimation model to obtain image depth information.

The monocular depth estimation model is a pre-training convolutional neural network model for generating parallax images through a training network under the condition that image reconstruction is lost by utilizing epipolar geometric constraint. Its training principle can be expressed simply as:

firstly, taking the real left image of the binocular camera as input, and outputting two parallax images corresponding to the left image and the right image of the binocular camera respectively through a convolutional neural network. And then taking the real right image as input, and processing the predicted parallax image and the real right image to generate an estimated left image. The estimated left graph is then compared to the true left graph and the training network is further back propagated through the loss-of-loss function. During training, no depth data is needed, but depth is taken as an intermediate value.

The trained monocular depth estimation model may be used to predict depth information from a single picture.

After the image depth information is obtained through the monocular depth estimation model, the sparse point cloud can be updated into a dense point cloud according to the image depth information and used as space data. And finally realizing the space perception capability at the mobile terminal.

In summary, the present application provides a method for generating mobile terminal spatial data, which greatly reduces redundant image frames by determining key frames in a video stream, setting an edge sliding window and a local sliding window, thereby reducing computation complexity. And furthermore, the movable end with the monocular photography module, the inertial sensing module and the positioning module can also realize the space sensing capability, so that the implementation cost is reduced, and the potential of converting the space sensing technology into the productivity is further excited.

Referring to fig. 2, in order to support a method for generating spatial data of a mobile terminal, the present application further provides a spatial aware mobile terminal 100, including:

the inertial sensing module 11 is used for acquiring initial inertial pose information corresponding to the time sequence;

a monocular photography module 12 for capturing video streams corresponding to the time series;

the positioning module 13 is used for acquiring global track information corresponding to the time sequence;

the space data generating module 14 is used for pre-integrating the initial inertial pose to obtain optimized inertial pose information; the method is also used for determining the image characteristics of adjacent frames in the video stream by adopting a sparse optical flow algorithm; the method is also used for generating initial visual pose information by adopting a motion structure recovery algorithm according to the image characteristics of adjacent frames in the video stream; the method is also used for loosely coupling the initial visual pose information and the optimized inertial pose information to obtain calibration parameter values; the method is also used for determining key frames in the video stream according to a key frame strategy; the method is also used for configuring an marginalized sliding window according to the calibration parameter value and the key frame so as to reduce the calculation complexity; the method is also used for calculating accumulated integral residual errors between adjacent frame images based on the marginalized sliding window; the method is also used for carrying out close-coupled nonlinear optimization on the initial visual pose information and the optimized inertial pose information based on the accumulated integral residual error to generate estimated pose information; the method is also used for configuring a local sliding window according to the calibration parameter value, the feature descriptor extraction algorithm and the local closed-loop constraint; the method is also used for carrying out local loop detection on the estimated pose information based on the local sliding window to generate local estimated pose information; the method is also used for establishing an association relation between global track information and local estimated pose information according to the time sequence, and generating sparse point cloud with position attribute; the method is also used for inputting a video stream to the monocular depth estimation model to obtain image depth information; and updating the sparse point cloud into a dense point cloud according to the image depth information, and taking the dense point cloud as space data.

It may be appreciated that the mobile terminal 100 may be embodied as at least one mobile terminal of a smart phone, a smart watch, or a portable computer in a specific application scenario.

The inventor finds that as the production technology of the mobile terminal tends to mature, the consumer-level hardware device is generally provided with a monocular photography module 11, an inertial sensing module 12 and a positioning module 13, so that measurement data required by space perception can be obtained. In other words, if the space perception capability is given to the consumer-level hardware device, the implementation cost of the space perception can be reduced, and the space perception technology is motivated to be converted into the potential of productivity.

Specifically, the inertial sensing module 11 collects initial inertial pose information corresponding to the time series. The spatial data generation module 14 pre-integrates the initial inertial pose to obtain optimized inertial pose information.

It is to be understood that the inertial sensor module 11 may be embodied as a gyroscope, an acceleration sensor, etc. in a specific application scenario. Generally, the inertial sensing module 11 can directly measure the angular velocity and acceleration of the mobile terminal 100 during movement as initial inertial pose information, and the response speed is high. Of course, in a preferred embodiment of the present application, the inertial sensing module 11 records the angular velocity and the acceleration of the mobile terminal 100 during movement according to the time stamp, and arranges the angular velocity and the acceleration in time series to form the initial inertial pose information.

It should be noted that, since the inertial sensing module 11 of the mobile terminal 100 of the present application is difficult to ensure high accuracy, the initial inertial pose information is easily affected by offset and noise when measured, and thus pose divergence is caused.

For this purpose, the spatial data generation module 14 is also required to pre-integrate the initial inertial pose to obtain optimized inertial pose information.

It is also emphasized that even if the spatial data generation module 14 pre-integrates the initial inertial pose, there is a significant drift in the optimized inertial pose information. And drift may accumulate as time sequences increase. The use of only optimized inertial pose information as a basis for spatial perception is highly unreliable. For example, even if the mobile terminal 100 is fixed in place, the spatial data generation module 14 pre-integrates the initial inertial pose information, and the resulting optimized inertial pose information has drift.

The monocular photography module 12 captures a video stream corresponding to the time series. The spatial data generation module 14 uses a sparse optical flow algorithm to determine image features of adjacent frames in the video stream. The spatial data generation module 14 generates initial visual pose information by using a motion structure recovery algorithm according to image features of adjacent frames in the video stream.

It will be appreciated that the monocular photography module 12 appears as a monocular lens in a particular application scenario. The video stream is made up of a number of image frames. Typically, the image frames have a time stamp. And arranging a plurality of image frames in a time sequence according to the time stamp corresponding to the image frame, namely forming the video stream.

The spatial data generation module 14 adopts a sparse optical flow algorithm to determine image features of adjacent frames in the video stream, and specifically includes:

In a specific application scenario provided by the present application, the implementation process of generating the initial visual pose information by the spatial data generation module 14 using a motion structure recovery algorithm according to the image features of the adjacent frames in the video stream is shown as follows:

However, when the image frame changes, it is essentially unknown whether the mobile terminal 100 itself moves or the external condition changes, so that it is difficult to process the dynamic obstacle with the pure visual pose information. Meanwhile, since the accuracy of triangulation based on visual feature points and inter-frame displacement are related, when the mobile terminal 100 performs an approximate rotational motion, the triangulation may be degraded, which may result in loss of image feature tracking. And motion structure restoration algorithms typically take the first frame as the current coordinate system, such that the estimated pose is relative to the pose of the first frame image, and not relative to the world coordinate system.

The spatial data generation module 14 loosely couples the initial visual pose information and the optimized inertial pose information to obtain calibration parameter values.

For this purpose, the spatial data generating module 14 is required to perform loose coupling on the initial visual pose information and the optimized inertial pose information to obtain calibration parameter values so as to complete initialization setting. The loose coupling is to directly fuse the initial visual pose information and the optimized inertial pose information, the fusion process does not affect the initial visual pose information and the optimized inertial pose information, and the initial visual pose information and the optimized inertial pose information are generally fused through an EKF and output as a post-processing mode. Specifically, the spatial data generating module 14 obtains initial visual pose information under the same time stamp through the calibration board, optimizes inertial pose information to align pose sequences, and calculates calibration parameter values such as rotational displacement and time delay between coordinate systems. And the position and the posture of the image frame and the position of the image feature at the last moment in the image of the next frame can be well predicted by optimizing the inertial pose information, so that the matching speed of the feature tracking algorithm and the robustness of the algorithm for coping with the rapid rotation are improved. And finally, optimizing a gravity vector provided by the acceleration measured in the inertial pose information, and constructing the connection between the current coordinate system and the world coordinate system. The spatial data generation module 14 may convert the estimated position to a position in the world coordinate system based on the association of the current coordinate system and the world coordinate system.

It should be emphasized that, in terms of hardware implementation, when the mobile terminal 100 is represented as a mobile terminal such as a smart phone, a smart watch or a portable computer, the inertial sensor module 11 may also be directly embedded in the circuit board of the monocular photography module 12, so as to provide a low-cost and high-performance space sensing scheme.

It is also emphasized that the above only proposes implementation possibilities of a spatial awareness scheme on a hardware implementation. It is also desirable to reduce the computational complexity of the spatial awareness technique based on the limited computational power of the mobile terminal 100.

The spatial data generation module 14 determines key frames in the video stream according to a key frame policy to reduce redundant image frames and thereby reduce computational complexity.

In one embodiment of the present application, the spatial data generation module 14 determines a key frame in a video stream according to a key frame policy, and specifically includes:

And when the number of the image feature points of the adjacent frames in the video stream is smaller than a first preset threshold, taking the image frames with later time sequences in the adjacent frames as the first key frames, so that the complete loss of the feature track can be avoided. When the average parallax is greater than the second preset threshold, and the image frames corresponding to the time sequence are determined to be the second key frames, the image feature tracking loss caused by the degradation of triangulation can be avoided when the monocular photography module 12 performs approximate rotational motion.

The spatial data generation module 14 configures an marginalized sliding window according to the calibration parameter values and the key frames to reduce the computational complexity.

acquiring an image frame after a key frame time sequence;

The spatial data generation module 14 calculates an accumulated integration residual between adjacent frame images based on the marginalized sliding window. The spatial data generation module 14 performs close-coupled nonlinear optimization on the initial visual pose information and the optimized inertial pose information based on the accumulated integral residual errors to generate estimated pose information.

It will be appreciated that while marginalized sliding windows limit computational complexity, they also introduce cumulative drift. To eliminate the accumulated drift, the spatial data generation module 14 is required to calculate the accumulated integration residual between adjacent frame images based on the marginalized sliding window.

The spatial data generation module 14 then performs close-coupled nonlinear optimization on the initial visual pose information and the optimized inertial pose information based on the accumulated integral residual, and finally generates estimated pose information.

The spatial data generation module 14 configures a local sliding window according to the calibration parameter values, the feature descriptor extraction algorithm, and the local closed loop constraints. The spatial data generation module 14 performs local loop detection on the estimated pose information based on the local sliding window, and generates local estimated pose information.

It will be appreciated that to further limit the computational complexity, the inventors introduced a local sliding window to further screen and save key frames. Specifically, the spatial data generation module 14 configures a local sliding window according to the calibration parameter values, the feature descriptor extraction algorithm, and the local closed loop constraint. The feature descriptor extraction algorithm is expressed as a BRIEF descriptor in a specific application scene and is used for calculating the similarity between a key frame and an adjacent frame. The partial sliding window is configured to:

Recording key frames meeting local closed loop constraint;

And then the spatial data generation module 14 adds the constraint of local loop detection to the estimated pose information based on the local sliding window, performs nonlinear optimization, and generates more accurate local estimated pose information.

The localization module 13 collects global trajectory information corresponding to the time series.

The spatial data generation module 14 establishes an association relationship between global track information and local estimated pose information according to the time sequence, and generates sparse point clouds with position attributes.

It can be appreciated that the positioning module 13 is shown as a GNSS sensor in a specific application scenario, and may directly measure the track information of the mobile terminal 100 when moving in the world coordinate system. The track information is composed of a plurality of coordinate points. Typically, the coordinate points have time stamps. And arranging a plurality of coordinate points in a time sequence according to the time stamp corresponding to the coordinate point, namely forming the track information.

It should be noted that the local estimated pose information generated by the spatial data generation module 14 can provide accurate pose information in a small area, but since there is no participation of the world coordinate system, the local estimated pose information takes the first frame as the current coordinate system, and thus the estimated pose is the pose with respect to the first frame image, not the pose with respect to the world coordinate system. Thus, even if the mobile terminal 100 performs spatial perception in the same environment, different local estimated pose information may be obtained when the mobile terminal 100 originates from different points.

In a preferred embodiment of the present application, the spatial data generating module 14 further establishes an association relationship between the global track information and the local estimated pose information according to the time sequence. The spatial data generation module 14 then uses sparse matrix Cholesky decomposition to perform nonlinear optimization on the data fused by the global track information and the local estimated pose information, and generates a sparse point cloud with a position attribute.

The spatial data generation module 14 inputs the video stream to the monocular depth estimation model to obtain image depth information. The spatial data generation module 14 updates the sparse point cloud to a dense point cloud as spatial data according to the image depth information.

It will be appreciated that in order to accommodate the performance limitations of the mobile terminal 100 device, the present application employs a sparse optical flow algorithm for feature tracking. Therefore, even if the global track information and the local estimated pose information are aligned with the world coordinate system, the optimization result is still a sparse point cloud. Further, sparse point clouds are still not actual physical dimensions. For this purpose, the spatial data generation module 14 is also required to input the video stream to the monocular depth estimation model, resulting in image depth information.

After the spatial data generation module 14 obtains the image depth information through the monocular depth estimation model, the sparse point cloud may be updated to a dense point cloud according to the image depth information, as spatial data. Eventually achieving spatial awareness capabilities at the mobile end 100.

In summary, the present application provides a spatial aware mobile terminal 100, which determines key frames in a video stream, sets an marginalized sliding window, and a local sliding window through a spatial data generation module 14, so as to greatly reduce redundant image frames, thereby reducing computational complexity. And the mobile terminal can realize the space perception capability, so that the implementation cost is reduced, and the potential of converting the space perception technology into the productivity is further excited.

It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the statement "comprises" or "comprising" an element defined by … … does not exclude the presence of other identical elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. The method for generating the space data of the mobile terminal is characterized by being applied to the mobile terminal with an inertial sensing module, a monocular shooting module and a positioning module, and comprises the following steps of:

determining key frames in the video stream according to a key frame strategy;

2. The method for generating spatial data at a mobile terminal according to claim 1, wherein determining key frames in a video stream according to a key frame policy specifically comprises:

3. The mobile-side spatial data generation method of claim 1, wherein the marginalized sliding window is configured to:

acquiring an image frame after a key frame time sequence;

4. The mobile-side spatial data generation method of claim 1, wherein the local sliding window is configured to:

recording key frames meeting local closed loop constraint;

5. The method for generating spatial data of a mobile terminal according to claim 1, wherein the mobile terminal comprises at least one mobile terminal of a smart phone, a smart watch, or a portable computer.

6. A spatially aware mobile terminal, comprising:

7. The spatially aware mobile terminal of claim 6, wherein the spatial data generation module is configured to determine key frames in the video stream according to a key frame policy, in particular to:

8. The spatially aware mobile terminal of claim 6, wherein the marginalized sliding window is configured to:

acquiring an image frame after a key frame time sequence;

9. The spatially aware mobile terminal of claim 6, wherein the local sliding window is configured to:

recording key frames meeting local closed loop constraint;

10. The spatially aware mobile terminal of claim 6, wherein the spatially aware mobile terminal comprises at least one of a smart phone, a smart watch, or a portable computer.