CN116563478A

CN116563478A - Synchronous positioning and mapping SLAM algorithm, terminal and storage medium

Info

Publication number: CN116563478A
Application number: CN202210102695.9A
Authority: CN
Inventors: 谢柠蔚
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2023-08-08

Abstract

The embodiment of the application discloses a synchronous positioning and mapping SLAM algorithm, a terminal and a storage medium, wherein the terminal acquires a training data set; performing self-supervision training treatment according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels; performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix; and constructing point cloud data based on the target transformation matrix.

Description

Synchronous positioning and mapping SLAM algorithm, terminal and storage medium

Technical Field

The invention relates to the technical field of computer vision, in particular to a synchronous positioning and mapping (simultaneous localization and mapping, SLAM) algorithm, a terminal and a storage medium.

Background

The existing SLAM algorithm can be mainly divided into a characteristic point method and a direct method; firstly, extracting features and corresponding descriptors from images by a feature point method, wherein the features can be kept unchanged when the camera visual angle changes slightly, so that data association among the images can be realized through feature matching in each image; when the two-dimensional to two-dimensional mapping relation is obtained, the pole geometry can be solved, and when the three-dimensional to two-dimensional mapping relation is obtained, the Perspotive-n-Point (PnP) algorithm, the Efficient Perspective-n-Point (EPnP) algorithm, the Perspective Three-Point (P3P) algorithm and the like can be used for solving; when a three-dimensional to three-dimensional mapping is obtained, then an iterative closest point (Iterative Closest Point, ICP) algorithm may be used for solving. After the transformation matrix is obtained by solving, fine tuning and optimization can be carried out on the three-dimensional coordinates of the pose of the camera and the space point by calculating the reprojection error, so that a final value is obtained. The direct method is to complete data association from optical flows in two frames of images by using the gray level invariance assumption, so that the result is estimated by minimizing photometric errors; at the same time, deep learning may also be applied in SLAM.

However, the existing SLAM algorithm still has some problems, that is, the existing SLAM algorithm generally has a defect of poor robustness, because not the information of each region block has a promotion effect on the estimation result for the whole input image, for example, the model accuracy is reduced for dynamic objects in the environment.

Disclosure of Invention

The embodiment of the application provides an SLAM algorithm, a terminal and a storage medium, which have stronger robustness and accuracy.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a SLAM algorithm, where the method includes:

acquiring a training data set;

performing self-supervision training treatment according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels;

performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix;

and constructing point cloud data based on the target transformation matrix.

In a second aspect, embodiments of the present application provide a terminal, where the terminal includes an acquisition unit, a training unit, an acquisition unit, and a construction unit,

the acquisition unit is used for acquiring a training data set;

the training unit is used for performing self-supervision training processing according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels;

the acquisition unit is used for carrying out feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix;

the construction unit is used for constructing point cloud data based on the target transformation matrix.

In a third aspect, embodiments of the present application provide a terminal comprising a processor, a memory storing instructions executable by the processor, which when executed by the processor, implement a SLAM algorithm as described above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a program stored thereon, for use in a terminal, the program, when executed by a processor, implementing a SLAM algorithm as described above.

The embodiment of the application provides an SLAM algorithm, a terminal and a storage medium, wherein the terminal acquires a training data set; performing self-supervision training treatment according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels; performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix; and constructing point cloud data based on the target transformation matrix. Therefore, the global attention module is added into the structure of the initial network model, so that the global attention module can learn the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels, and the target network model obtained after training can automatically 'notice' the importance degrees of different image areas when the feature extraction processing is carried out, and the model accuracy is greatly improved; and then, feature extraction processing is carried out on target image information according to a target network model, and point cloud data is constructed based on the obtained target transformation matrix, so that the method has stronger robustness and accuracy.

Drawings

Fig. 1 is a schematic diagram of an implementation flow of a SLAM algorithm according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 3 is a second schematic diagram of an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 4 is a third schematic diagram of an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 5 is a schematic diagram showing an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 6 is a fifth schematic diagram of an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 7 is a sixth schematic diagram illustrating an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 8 is a schematic diagram seventh of an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 9 is a schematic diagram eighth implementation diagram of the SLAM algorithm according to the embodiment of the present application;

FIG. 10 is a schematic diagram of an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 11 is a schematic diagram showing an implementation of the SLAM algorithm according to the embodiment of the present application;

FIG. 12 is an eleventh schematic diagram of an implementation of the SLAM algorithm according to the embodiment of the present application;

fig. 13 is a schematic diagram of a composition structure of a terminal according to an embodiment of the present application;

fig. 14 is a schematic diagram of a second component structure of the terminal according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to be limiting. It should be noted that, for convenience of description, only a portion related to the related application is shown in the drawings.

Real-time positioning and surrounding environment mapping during the movement of a machine carrying a sensor are important problems of artificial intelligence, and the main problem to be solved is that the machine realizes environment sensing, understanding and self positioning in unfamiliar environments. In the artificial intelligence era, the technology can replace manpower in certain special occasions, such as automatic driving, mobile robots, augmented Reality (Augmented Reality, AR)/Virtual Reality (VR) and other application scenes. In positioning, the primary task of a machine is to sense and then characterize the surrounding environment, and many solutions for autonomous positioning and environment reconstruction of the machine for known prior environmental information exist, such as a global positioning system (Global Positioning System, GPS), an inertial navigation system (Inertial Navigation System, INS), a lidar system and the like; however, there are different disadvantages such as that GPS can only be used in outdoor scenes, and INS has a problem of accumulated drift; the cost of a lidar system is high, etc. With the development of computer vision, only a vision sensor, namely a camera, is used for realizing positioning and mapping without prior, so that researchers are widely concerned, the vision sensor is low in cost and high in precision, and a large amount of image data can be obtained for other environment sensing systems such as target detection.

The traditional SLAM algorithm has developed for many years to form a relatively stable effect and an implementation method, and can be mainly divided into a characteristic point method and a direct method; firstly, extracting features and corresponding descriptors from images by a feature point method, wherein the features can be kept unchanged when the view angle of a camera changes slightly, so that data association among the images can be realized through feature matching in each image; when the two-dimensional to two-dimensional mapping relation is obtained, the pole geometry can be solved, and when the three-dimensional to two-dimensional mapping relation is obtained, the permanent-n-Point (PnP) algorithm, the Efficient Perspective-n-Point (EPnP) algorithm, the Perspective Three-Point (P3P) algorithm and the like can be used for solving; when a three-dimensional to three-dimensional mapping relationship is obtained, then an iterative closest point (Iterative Closest Point, ICP) algorithm may be used for solving. After the transformation matrix is obtained by solving, fine tuning and optimization can be carried out on the three-dimensional coordinates of the pose of the camera and the space point by calculating the reprojection error, so that a final value is obtained. Whereas the direct method uses the gray scale invariance assumption to complete data correlation from the optical flow in two frames of images, thereby estimating the result by minimizing photometric error.

With the development of deep learning in recent years, deep learning is also applied in the SLAM field. The basic idea of the supervised deep learning method is to design a loss function capable of guiding network training, and further train the network through a data set with labels, so that the camera pose and the pixel depth of six degrees of freedom (6 DOF) are returned.

However, the above-mentioned conventional SLAM algorithm still has some problems, for example, when the camera motion speed is high and the rotation angle is too large, the geometric calculation threshold set by the algorithm may be exceeded, so that the situation of "lost following" occurs; meanwhile, the algorithm assumes that the scene is static, and has no good processing method for common dynamic objects in the environment such as people, vehicles and the like, so that the model accuracy is reduced; for example, the traditional method uses the image shallow layer characteristics of manual design, and the algorithm accuracy is reduced or even can not run in the face of challenging scenes such as severe illumination intensity change, lower brightness, not abundant environment characteristic textures and the like; the calculation amount of geometric calculation is large, and the calculation load on the system is large; the characteristics of the manual design are sparse, and the information of the whole image cannot be fully utilized.

Even though the existing supervised deep learning method has defects, for example, when features are extracted, the input accepted by a network is generally a whole image, in an actual image, not the information of each area block has an promotion effect on the estimation result, for example, the area with a dynamic object can cause the reduction of precision, the image blocks of different areas can have long-distance coupling relation, and a general network can not model the spatial information, so that the model precision is reduced, namely the defect of poor robustness is generally caused; in addition, the supervised deep learning method is severely dependent on manually labeled data sets, and the data sets can usually be obtained by a large amount of manpower and material resources, so that the labor cost is increased.

In order to solve the problems of the synchronous positioning and mapping SLAM algorithm in the prior art, the embodiment of the application provides the synchronous positioning and mapping SLAM algorithm, a terminal and a storage medium, wherein the terminal acquires a training data set; performing self-supervision training treatment according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels; performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix; constructing point cloud data based on the target transformation matrix; therefore, the SLAM algorithm has stronger robustness and accuracy.

It should be noted that the invention innovatively uses the global attention mechanism in the feature extraction stage of the SLAM network, so that the model automatically learns the dependency relationship among pixels, image blocks and channels, thereby giving importance weight to the spatial information of the extracted features, and reasonably utilizing the spatial information among images and features, thereby improving the accuracy of the model. The key point of the SLAM algorithm provided by the embodiment of the application is that better characteristics can be extracted, because a single picture is directly described for image characteristics, and geometric information stored between adjacent images can provide data and a foundation for the whole calculation process of the SLAM algorithm, the effect of positioning and depth prediction is directly determined by the quality of the image characteristics. Therefore, the global attention module adopted in the embodiment of the application participates in the extraction of the features, so that the effect of promoting the results of the model can be greatly improved, the effect of promoting is mainly expressed in three aspects, firstly, the method can have higher precision, and when the global attention module is adopted to extract the image features, the spatial information of channel correlation and long-distance dependence is modeled, so that the model can learn more image information, and the precision is improved; secondly, robustness to challenging scenes is strong. Because the traditional SLAM algorithm depends on a static world hypothesis and adopts an image shallow layer characteristic of manual design, the precision is severely reduced and even the system cannot run when facing to scenes such as illumination change, dynamic objects and the like, and the system can still stably run under severe scenes by adopting a neural network based on deep learning to extract high-dimensional image characteristics; thirdly, the cost is reduced, and the model realizes the non-supervision training of the network based on affine transformation processing, so that a large number of manual annotation data sets can be omitted, and continuous video sequences can be acquired by using a visual sensor, such as a camera, for network training.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Example 1

An embodiment of the present application provides a transmission method, fig. 1 is a schematic implementation flow diagram of a SLAM algorithm provided in the embodiment of the present application, and as shown in fig. 1, the SLAM algorithm may include the following steps:

step 101, acquiring a training data set.

In the embodiment of the application, the terminal may acquire the training data set first.

It should be noted that, in the embodiment of the present application, the training data set is used to perform training processing on the initial network model, and the training data set may be a data set including a video sequence.

Illustratively, in embodiments of the present application, a KITTI data set is utilized as the training data set; the KITTI data set is an algorithm evaluation data set under the current largest automatic driving scene, and comprises real image data collected by scenes such as urban areas, villages, highways and the like, wherein 22 video sequences are total, and 11 video sequences are provided with group-trunk labels; the first 9 sequences 00-08 are taken as training data sets in the embodiment of the application; in addition, the test data set can be determined according to the KITTI data set, so that the target network model after training treatment is tested, for example, 09-10 in the KITTI data set is selected as the test data set to test the network performance.

Further, in the embodiment of the present application, in order to improve the speed and accuracy of the training process, the training data set may be preprocessed in advance, including a resizing operation and the like; the resizing operation may be to resize the image to a preset size, for example, 224×224; therefore, the method is more suitable for training processing of an initial network model and meets the real-time requirement in actual deployment.

102, performing self-supervision training treatment according to a training data set and an initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels.

In the embodiment of the application, after acquiring the training data set, the terminal can perform self-supervision training processing according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels.

It should be noted that, in the embodiment of the present application, the initial network model includes a first sub-network and a second sub-network; the first subnetwork may be a pose regression subnetwork; the second subnetwork may be a deep predictive subnetwork.

Further, in an embodiment of the present application, the feature extraction portion of the first subnetwork may be a res net network.

For example, in the embodiment of the present application, fig. 2 is a schematic diagram illustrating implementation of the SLAM algorithm proposed in the embodiment of the present application, and as shown in fig. 2, exemplary 5 kinds of network structures of res net, including res net18 (18-layer), res net34 (34-layer), res net50 (50-layer), res net101 (101-layer), and res net152 (152-layer); taking the structure of ResNet18 as an example, the input section is composed of a 7×7 convolution kernel and a 3×3 maximum pooling layer, corresponding to an output size (output size) of 112×112; wherein the dimension of the 7 x 7 convolution kernel is 64 dimensions, and the step size (stride) is 2; the step size of the 3×3 max pooling layer is 2; the middle part is mainly used for feature extraction, and the middle part can comprise four convolution blocks: conv2_x, conv3_x, conv4_x, and conv5_x; each convolution block may be composed of stacked residual blocks, the number of stacking of the residual blocks being [2, respectively]That is, conv2_x is composed of a stack of 2 residual blocks; the output part is composed of an average pooling layer (average pool), a 1000-dimensional full connection layer (fully connected layers, FC) and a softmax function; different configurations may correspond to different computational forces (floating point operations per second, FLOPs), e.g., a FLOPS of 1.8X10 for ResNet18 ⁹ 。

Further, in an embodiment of the present application, the ResNet network includes an input portion, an output portion, and an intermediate portion; wherein the middle part is made up of stacked residual blocks.

In an embodiment of the present application, fig. 3 is a schematic diagram illustrating implementation of a SLAM algorithm provided in the embodiment of the present application, and as shown in fig. 3, a schematic diagram of a residual block (basic-block) is shown, input data is 64 dimensions, and the input data respectively passes through two paths, one path passes through two 3×3 convolutions, the other path is shorted, and after the two paths are added, the output is performed through an unsaturated activation function (Rectified Linear Units, reLU).

For example, in the embodiment of the present application, the first subnetwork may be constructed based on a res net18 network, fig. 4 is a schematic diagram three of implementation of the SLAM algorithm set forth in the embodiment of the present application, and fig. 4 shows a structure of the first subnetwork using the res net18 network, which may mainly include an Input section (Input step), a middle section, and an Output section (Output); the middle part can comprise four convolution blocks, namely Stage 1, stage 2, stage 3 and Stage 4; the input part consists of a 7×7 convolution kernel and a 3×3 max pooling layer, wherein the dimension of the 7×7 convolution kernel is 64 dimensions, and the step size is 2; the step size of the 3×3 max pooling layer is 2; stage 4 may include downsampling (downsampling) and Residual blocks (Residual); the output section may implement global adaptive smoothing pooling, for example, may convert a 1×512×7×7-dimensional input data to 1×512×1×1-dimensional, resulting in features that are ultimately used to regress the 6D0F pose.

Further, in an embodiment of the present application, the deep prediction subnetwork is a U-Net network; the U-Net network includes an encoder and a decoder; the encoder (encoder) may be a ResNet network; the decoder (decoder) may be a DispNet network.

Illustratively, in embodiments of the present application, the deep prediction subnetwork is a U-Net network; the encoder of the depth prediction sub-network can adopt ResNet50, and the decoder adopts DispNet network; wherein ResNet50 has the same basic structure as ResNet18, varying only in the number of middle stacked convolutional blocks; FIG. 5 is a schematic diagram showing the implementation of SLAM algorithm according to the embodiment of the present application, and as shown in FIG. 5, the implementation is a schematic diagram of a U-Net network structure, the network predicts depth directly on a small feature map while deconvolving backward (unconnv) in the forward direction, and the resulting bilinear interpolation is then spliced on the unconnv-post feature map, and then the subsequent deconvolution is repeated four times, and the obtained predicted depth resolution is 1/4 of the input, and then the previous frame image I in the adjacent frame images obtained and input by bilinear interpolation _t Depth map D of the same resolution _t 。

Further, in the embodiment of the present application, when training is performed, adjacent frame images in the training dataset are input into the initial network model; when the adjacent frame images are input, the two adjacent frame images can be spliced according to the channels; for example, two adjacent frame images are spliced to form image data of 256×256×6 size.

It is understood that in the embodiments of the present application, the target network model refers to a network model obtained after the initial network model is trained.

It should be noted that, in the embodiment of the present application, a global attention module is added in the initial network model, compared with a common network in the prior art, the common network treats each pixel point on the feature map extracted by the original image and the encoder, and each channel is treated equally, so that there is a great defect in this way, for example, in an actual image, not the information of each area block has a promoting effect on the result of SLAM, when the area where a dynamic object exists, the accuracy will be reduced, and the image blocks of different areas will have long-distance coupling relation; the long-distance coupling relation between the image blocks can be automatically learned by the model through adding the global attention module, and importance weights are automatically given to the information of different image blocks, so that the importance degree of different areas in the image can be automatically noticed, and the accuracy of the model is improved; the global attention module can also realize channel weighting so as to establish the relevance of the space information; the feature extraction part of the pose regression sub-network and the depth prediction sub-network is added with the global attention module, so that feature extraction can be better performed, and the precision and the robustness of the model are improved.

Further, in an embodiment of the present application, the global attention module is a global context attention GC module; the global attention module includes a first attention module and a second attention module.

Further, in an embodiment of the present application, the first subnetwork comprises a first attention module; the second subnetwork includes a second attention module; that is, the present application adds the first attention module to the first sub-network and adds the second attention module to the second sub-network, so that the first sub-network and the second sub-network can perform further spatial information fusion weighting on the features when performing feature extraction processing.

Illustratively, in the embodiment of the present application, the first sub-network is a pose regression sub-network, the second sub-network is a deep prediction sub-network, and the global attention module is a GC module; adding a first attention module to the feature extraction part of the pose regression sub-network, namely after the residual block added to the feature extraction part is finished; the second attention module is added in the encoder of the depth prediction sub-network, for example after the end of the residual block added in the encoder.

Note that common attention mechanisms include a channel attention mechanism (SE) and a spatial attention mechanism (SNL); FIG. 6 is a fifth schematic diagram of an implementation of the SLAM algorithm according to the embodiment of the present application, and as shown in FIG. 6, the structure of the channel attention mechanism is shown; wherein. The symbol "" indicates broadcast element multiplication; the channel attention mechanism lets the model learn the importance of different channels to weight them. FIG. 7 is a schematic diagram sixth of an implementation of the SLAM algorithm according to the embodiment of the present application, and as shown in FIG. 7, the structure of the spatial attention mechanism is shown; wherein the symbol ++indicates the broadcast element addition; in the spatial attention mechanism, the conversion section has a large number of parameters, wherein a 1×1 convolutional layer W containing c×c parameters is employed _v . Spatial attention can establish the relevance of global spatial information by modeling long-distance dependency and weighting and fusing long-distance information.

Further, the embodiment of the present application adopts a global context attention module (Global Context Network, GC), fig. 8 is a schematic diagram seventh for implementing the SLAM algorithm proposed in the embodiment of the present application, and fig. 8 shows the structure of the global context attention module; wherein the symbol is'"represents matrix multiplication; in the global context attention module, attention weights are first obtained by using a 1×1 convolution kernel softmax function, and then multiplied by the original input features to obtain the global contextFeatures; unlike the spatial attention mechanism, the global context attention module employs a bottleneck transformation module instead of the convolutional layer W _v The bottleneck conversion module can establish a channel dependency relationship, can reduce the parameter quantity from C×C to 2×C×C/r, wherein r is the bottleneck ratio, and is set as r=16 by default, and compared with a spatial attention mechanism, the parameter quantity of the conversion module can be reduced to 1/8 of the spatial attention mechanism; since the bottleneck conversion module of the global context attention module increases the difficulty of optimization, a normalization layer is added before bottleneck conversion to facilitate optimization. The global context attention module has the capacity of modeling the global context of the space, can realize channel weighting, establishes the relevance of space information, has smaller parameter and low calculation complexity; that is, the global context attention module includes global attention computation for context modeling, channel dependence modeling with bottleneck transformation, and feature fusion with original input in a broadcast element manner; and because the global context attention module is lightweight, it can be applied to multiple layers of the network to better capture remote dependencies. The global context attention module can be expressed as the following formula:

Wherein, the liquid crystal display device comprises a liquid crystal display device,is global attention weight alpha _j I.e. +.>δ(·)＝W _v2 ReLU{LN{W _v1 (. Cndot.) } represents bottleneck transitions.

It should be noted that, in the embodiment of the present application, the role of the GC module mainly includes three aspects, the first aspect is to obtain the global context feature by using the context modeling, the second aspect is to establish the channel dependency relationship based on the bottleneck conversion module, and the third aspect is to perform feature fusion with the original input feature in the manner of broadcasting the element. Moreover, since the GC module is lightweight, it can be applied to multiple layers in the network, enabling better capture of remote dependencies.

In some embodiments of the present application, the method for obtaining the target network model by the terminal performing self-supervised training processing according to the training data set and the initial network model may include: obtaining a depth map and a transformation matrix based on adjacent frame images in the training data set and the initial network model; the depth map represents depth information of a previous frame image in adjacent frame images; the transformation matrix represents the relative transformation relation between the adjacent frame images; and obtaining a predicted image corresponding to a next frame image in the adjacent frame images according to the depth image and the transformation matrix, calculating a loss function value based on the predicted image, and performing self-supervision training treatment on the initial network model according to the loss function value to obtain a target network model.

And 103, performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix.

In the embodiment of the application, after performing self-supervision training processing according to the training data set and the initial network model to obtain the target network model, the terminal may perform feature extraction processing on the target image information based on the target network model to obtain the target transformation matrix.

In the embodiment of the present application, the target image information refers to the image information input when the point cloud data is constructed by using the target network model; the target image information may be a large amount of image data acquired from the vision sensor.

It can be understood that in the embodiment of the present application, when the feature extraction processing is performed on the target image information based on the target network model, the feature extraction processing is still performed on the basis of two adjacent frames of images in the target image information, so as to obtain the corresponding target transformation matrix.

It should be noted that, in the embodiment of the present application, the target transformation matrix is the estimated target of SLAM; the target transformation matrix may characterize a relative transformation relationship between adjacent frames of image information in the target image information.

And 104, constructing point cloud data based on the target transformation matrix.

In the embodiment of the application, after performing feature extraction processing on target image information based on a target network model to obtain a target transformation matrix, the terminal can construct point cloud data based on the target transformation matrix.

In the embodiment of the present application, the point cloud data is a set of sampling points with spatial coordinates, which is huge and dense, and can be widely applied to the fields of mapping, electric power, construction, industry and the like.

Further, in an embodiment of the present application, the method for performing feature extraction processing on the target image information by the terminal based on the target network model may include the following steps:

step 201, performing global attention calculation on the target image information based on a target global attention module in the target network model so as to complete feature extraction processing.

In the embodiment of the application, the terminal performs feature extraction processing on the target image information based on the target network model, and in some embodiments of the application, the terminal may perform global attention calculation on the target image information based on the target global attention module in the target network model to complete the feature extraction processing.

It should be noted that, in the embodiment of the present application, the target global attention module refers to a global attention module included in a target network model obtained after training.

It may be understood that, in the embodiment of the present application, since the target network model includes the target global attention module, the process of performing feature extraction processing on the target image information by using the target network model further includes a calculation process participated by the target global attention module, which is global attention calculation.

Further, in the embodiment of the application, the target global attention module is utilized to perform global attention calculation, so that enhancement features corresponding to original input features input into the global attention module can be obtained.

Further, in the embodiment of the present application, the method for performing global attention calculation on the target image information by the terminal based on the target global attention module in the target network model, that is, the method proposed by step 201 may include the following steps:

step 201a, attention weight is acquired through a first module in the target global attention module, and global context characteristics are obtained by multiplying the attention weight with the original input characteristics.

In the embodiment of the application, the terminal performs global attention calculation on the target image information based on the target global attention module in the target network model, and in some embodiments of the application, the terminal may acquire the attention weight through the first module in the target global attention module, and multiply the attention weight with the original input feature to obtain the global context feature.

It should be noted that, in the embodiment of the present application, the structure of the first module is as shown in fig. 8, and may include a 1×1 convolution layer and a softmax function.

Further, in the embodiment of the present application, the attention weight may represent the importance degrees corresponding to different areas in the target image information.

Further, in the embodiment of the present application, as shown in the foregoing fig. 8, after the attention weight is obtained by the first module including the 1×1 convolution layer and the softmax function, the attention weight is multiplied by the original input feature corresponding to the target image information, thereby obtaining the global context feature.

Step 201b, obtaining a channel dependency relationship corresponding to the global context feature according to a second module in the global attention module; the second module comprises a bottleneck conversion module, a normalization layer and an activation layer.

In the embodiment of the application, after the terminal obtains the attention weight through the first module in the target global attention module and multiplies the attention weight by the original input feature to obtain the global context feature, the terminal may obtain the channel dependency relationship corresponding to the global context feature according to the second module in the global attention module; the second module comprises a bottleneck conversion module, a normalization layer and an activation layer.

It should be noted that, in the embodiment of the present application, the second module includes a bottleneck conversion module, a normalization layer, and an activation layer; illustratively, as shown in FIG. 8 above, the second module includes a bottleneck conversion module, namely two 1×1 convolutional layers conv (1×1), a normalized layer LayerNorm, and an active layer ReLU.

It can be seen that in the embodiment of the present application, the global context feature is input to the second module, and the channel dependency relationship can be output by the second module.

And step 201c, performing feature fusion processing according to the channel dependency relationship and the target image information to complete global attention calculation.

In the embodiment of the application, after the terminal obtains the channel dependency relationship corresponding to the global context feature according to the second module, feature fusion processing can be performed on the terminal according to the channel dependency relationship and the target image information so as to complete global attention calculation.

In the embodiment of the present application, after the channel dependency relationship is obtained, the feature fusion process is performed in a broadcast element addition manner according to the channel dependency relationship and the original input, that is, the target image information, as shown in fig. 8, so as to complete the global attention calculation.

The embodiment of the application provides an SLAM algorithm, wherein a terminal acquires a training data set; performing self-supervision training treatment according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels; performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix; and constructing point cloud data based on the target transformation matrix. Therefore, the global attention module is added into the structure of the initial network model, so that the global attention module can learn the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels, and the target network model obtained after training can automatically 'notice' the importance degrees of different image areas when the feature extraction processing is carried out, and the model accuracy is greatly improved; and then, feature extraction processing is carried out on the target image information according to the target network model, and point cloud data is constructed based on the obtained target transformation matrix, so that the method has stronger robustness and accuracy.

Example two

Based on the above embodiment, in another embodiment of the present application, the method for obtaining the target network model by the terminal performing self-supervision training processing according to the training data set and the initial network model, that is, the method proposed by step 102 may include the following steps:

102a, obtaining a depth map and a transformation matrix based on adjacent frame images in a training data set and an initial network model; wherein, the depth map characterizes the depth information of the previous frame image in the adjacent frame images; the transformation matrix characterizes the relative transformation relationship between adjacent frame images.

In the embodiment of the application, the terminal performs self-supervision training processing according to the training data set and the initial network model to obtain the target network model, and in some embodiments of the application, the terminal can obtain the depth map and the transformation matrix based on the adjacent frame image in the training data set and the initial network model; the depth map represents depth information of a previous frame image in adjacent frame images; the transformation matrix characterizes the relative transformation relationship between adjacent frame images.

It can be understood that, in the embodiment of the present application, the adjacent frame image includes a previous frame image and an image of a subsequent frame corresponding to the previous frame image; by inputting the adjacent frame images in the training dataset into the initial network model, a corresponding depth map and transformation matrix can be obtained.

In the embodiment of the present application, the depth map characterizes depth information of a previous frame image among the adjacent frame images.

It should also be noted that, in embodiments of the present application, the transformation matrix characterizes the relative transformation relationship between adjacent frame images in the training dataset.

102b, obtaining a predicted image corresponding to a next frame image in the adjacent frame images according to the depth map and the transformation matrix, calculating a loss function value based on the predicted image, and performing self-supervision training processing on the initial network model according to the loss function value to obtain a target network model.

In the embodiment of the application, after obtaining the depth map and the transformation matrix based on the adjacent frame images and the initial network model in the training data set, the terminal may obtain a predicted image corresponding to a next frame image in the adjacent frame images according to the depth map and the transformation matrix, calculate a loss function value based on the predicted image, and perform self-supervision training processing on the initial network model according to the loss function value to obtain the target network model.

In some embodiments of the present application, the method for obtaining, by the terminal, a predicted image corresponding to a subsequent frame image in the neighboring frame images according to the depth map and the transformation matrix may include: affine transformation processing is carried out according to the camera internal reference information, the depth map and the transformation matrix, so as to obtain a predicted image; the affine transformation processing is called warping calculation.

That is, in the embodiment of the present application, the terminal obtains the predicted image corresponding to the image of the subsequent frame mainly through affine transformation processing, and further calculates the loss function value by using the predicted image, so as to perform self-supervision training processing on the initial network model according to the loss function value; the method can realize joint optimization without manual labeling, reduce the dependence of the model on supervised data and effectively reduce training cost.

Further, in embodiments of the present application, the prediction image characterizes the prediction of a subsequent one of the neighboring frame images, which may also be understood as being "false".

Further, in the embodiment of the present application, the loss function value may be used to perform self-supervision training on the initial network model; the loss function values include a first loss function value and a second loss function value.

In some embodiments of the present application, a method of a terminal calculating a loss function value based on a predicted image may include: calculating a first loss function value based on the predicted image and a subsequent frame image of the adjacent frame images; wherein the first loss function value characterizes the degree of difference between a subsequent frame image and the predicted image in the adjacent frame images; calculating a second loss function value based on the depth map; wherein the second loss function value characterizes a degree of variation of the depth map; a loss function value is determined from the first loss function value and the second loss function value.

It can be understood that in the embodiment of the present application, the terminal performs self-supervision training processing on the initial network model according to the loss function value, that is, guides the subsequent training direction according to the loss function value; for example, the loss function value may be returned along the gradient minimum direction according to the derivative of the loss function, so as to correct each weight value in the forward calculation formula, until the loss function value meets the preset threshold, and the training process may be stopped.

Further, in the embodiment of the present application, the method for obtaining, by the terminal, the predicted image corresponding to the next frame image in the adjacent frame images according to the depth map and the transformation matrix, that is, the method proposed in step 102b may include the following steps:

and step 102b1, carrying out affine transformation processing according to the camera internal parameter information, the depth map and the transformation matrix to obtain the predicted image.

In the embodiment of the application, the terminal obtains the predicted image corresponding to the next frame image in the adjacent frame images according to the depth map and the transformation matrix, and in some embodiments of the application, the terminal may perform affine transformation processing according to the camera intrinsic information, the depth map and the transformation matrix to obtain the predicted image.

Illustratively, in the embodiments of the present application, the principle of affine transformation processing, i.e., warping calculation, can be expressed as: p is p _t+1 ～KT _t→s D _t (p _t )K ^-1 p _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein t and s respectively represent a next frame image and a previous frame image in the input adjacent frame images, p represents pixel point positions, K, T, D respectively represents camera internal reference informationTransform matrix and depth map. According to the method and the device, the predicted image corresponding to the image of the later frame can be obtained through affine transformation processing.

Further, in an embodiment of the present application, the method for calculating the loss function value based on the predicted image by the terminal may include the steps of:

step 301, calculating a first loss function value based on a predicted image and a later frame image in the adjacent frame images; wherein the first loss function value characterizes a degree of difference between a subsequent frame image and the predicted image in the neighboring frame images.

In embodiments of the present application, the terminal calculates a loss function value based on the predicted image, and in some embodiments of the present application, the terminal may calculate a first loss function value based on the predicted image and a subsequent frame image of the adjacent frame images; wherein the first loss function value characterizes a degree of difference between a subsequent frame image and the predicted image in the neighboring frame images.

It should be noted that, in the embodiment of the present application, the first loss function value is calculated based on the predicted image and the subsequent image of the adjacent frame images, in other words, the first loss function is calculated based on the predicted subsequent image of the adjacent frame images and the actual subsequent image of the adjacent frame images; thus, the first loss function value may characterize a degree of difference between a subsequent one of the adjacent frame images and the predicted image.

Further, in the embodiment of the present application, the calculation manner of the first loss function may include two calculation manners, where one calculation manner may be expressed as the following formula:

wherein N is the number of pixel points, I (t+1) represents the real image of the next frame,representing the predicted image corresponding to the subsequent frame image.

Further, in the embodiment of the present application, to better measure the difference between the real next frame image and the predicted next frame image, another calculation manner may be adopted in the embodiment of the present application: the first loss function value is calculated using a structural similarity index (Structural Similarity, SSIM).

Step 302, calculating a second loss function value based on the depth map; wherein the second loss function value characterizes a degree of change of the depth map.

In embodiments of the present application, the loss function value is calculated based on the predicted image, and in some embodiments of the present application, the terminal may calculate a second loss function value based on the depth map; wherein the second loss function value characterizes a degree of change of the depth map.

It should be noted that, in the embodiment of the present application, in order to make the obtained depth map have a locally smooth property, the smoothness error may be used to perform constraint optimization on the depth map, that is, calculate the second loss function based on the depth map.

For example, in the embodiment of the present application, the method for determining the second loss function value may be expressed as the following formula:

where I (x, y) represents a depth map and D (x, y) represents a pixel having coordinates (x, y).

Further, in an embodiment of the present application, the second loss function value characterizes a degree of variation of the depth map.

Step 303, determining a loss function value according to the first loss function value and the second loss function value.

In an embodiment of the present application, the terminal calculates the first loss function value based on the prediction image and a subsequent frame image of the neighboring frame images, and after calculating the second loss function value based on the depth map, may determine the loss function value from the first loss function value and the second loss function value.

In the embodiment of the present application, the first loss function value and the second loss function value may be respectively assigned corresponding weight parameters, and the weight parameters may be multiplied by the first loss function value and the second loss function value, respectively, so that the product of the first loss function value and the weight parameters is added to the product of the second loss function value and the weight parameters to obtain the loss function value.

Illustratively, in an embodiment of the present application, the loss function value determined from the first loss function value and the second loss function value is:

LOSS＝αL _photo +βL _smooth (4)

and alpha and beta are weight parameters corresponding to the first loss function value and the second loss function value respectively.

Further, in an embodiment of the present application, the method for calculating the first loss function value by the terminal based on the predicted image and the next frame image in the adjacent frame images, that is, the method proposed in step 301 may include the steps of:

step 301a, calculating a first loss function value according to the predicted image, the next frame image in the adjacent frame images, and the structural similarity index SSIM.

In embodiments of the present application, the terminal calculates the first loss function value based on the predicted image and a subsequent one of the neighboring frame images, and in some embodiments of the present application, the terminal may calculate the first loss function value from the predicted image, the subsequent one of the neighboring frame images, and the SSIM.

For example, the first loss function value calculated using SSIM may be expressed as the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,i.e., representing the portion calculated using SSIM; />Representing a predicted image; i (t+1) tableShowing the next frame image among the adjacent frame images.

Further, in an embodiment of the present application, the method for obtaining the depth map and the transformation matrix by the terminal based on the adjacent frame images in the training data set and the initial network model, that is, the method proposed by step 102a may include the following steps:

step 102a1, performing feature extraction processing on the adjacent frame images based on the first sub-network to obtain a transformation matrix.

In the embodiment of the application, the terminal obtains the depth map and the transformation matrix based on the adjacent frame images in the training data set and the initial network model, and in some embodiments of the application, the terminal may perform feature extraction processing on the previous frame image based on the first sub-network to obtain the depth map.

Illustratively, in the embodiment of the present application, the first sub-network is a pose regression sub-network constructed based on the res net18, and the feature extraction processing is performed on two adjacent frame images by using the pose regression sub-network to obtain a transformation matrix.

Step 102a2, performing feature extraction processing on the previous frame image based on the second sub-network to obtain a depth map.

In the embodiment of the application, the terminal obtains the depth map and the transformation matrix based on the adjacent frame images in the training data set and the initial network model, and in some embodiments of the application, the terminal may perform feature extraction processing on the previous frame image based on the second sub-network to obtain the depth map.

In an embodiment of the present application, the second sub-network is a depth prediction sub-network of a U-Net network structure, and the depth prediction sub-network is used to perform feature extraction processing on a previous frame image in the adjacent frame images to obtain a depth map.

In summary, in the embodiment of the present application, fig. 9 is an exemplary schematic diagram eight of the implementation of the SLAM algorithm provided in the embodiment of the present application, and fig. 9 is a principle of training processing on an initial network model, and the image information input is two adjacent frames: i (t) and I (t+1); inputting the previous frame image information I (t) of the image information of two adjacent frames into the second sub-network, i.eThe depth prediction sub-network inputs I (t) and I (t+1) into a first sub-network, namely a pose regression sub-network, and a specific diagnosis extraction part of the first sub-network can adopt a ResNet18 network; wherein the second subnetwork comprises an encoder decoder and a decoder; a GC module is arranged in the encoder in the second sub-network; the second sub-network is also provided with a global attention module, and after the feature extraction module of the PoseNet, the second sub-network also comprises a full connection layer (fully connected layers, FC); obtaining a depth map D (T) from the second sub-network, and obtaining a transformation matrix T from the first sub-network; further, a second loss function L is obtained according to the depth map D (T) _smooth Affine transformation processing (warping) is carried out according to the transformation matrix T and the depth map D (T) at the same time, and a predicted image is obtainedAnd based on->Performing corresponding calculation to obtain a first loss function L _photo Thus according to L _smooth And L _photo And determining a loss function, and completing self-supervision training processing based on the loss function.

Further, in the embodiment of the present application, the method for obtaining the target transformation matrix by performing the feature extraction processing on the target image information by the terminal based on the target network model, that is, the method proposed in step 103 may include the following steps:

and 103a, performing feature extraction processing on the target image information based on a target first sub-network in the target network model to obtain an initial transformation matrix.

In the embodiment of the application, the terminal performs feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix, and in some embodiments of the application, the terminal may perform feature extraction processing on the target image information based on a target first sub-network in the target network model to obtain an initial transformation matrix.

It should be noted that, in the embodiment of the present application, the target first sub-network performs the feature extraction process based on the two adjacent frames of images in the target image information.

It may be appreciated that in the embodiment of the present application, the target first sub-network is a trained first sub-network; for example, the target first subnetwork may be a trained pose regression subnetwork.

In the embodiment of the present application, the initial transformation matrix refers to matrix information obtained after feature extraction processing is performed on the target image information by using the target first sub-network.

And 103b, performing feature extraction processing on a previous frame image in the target image information based on a target second sub-network in the target network model to obtain a target depth map corresponding to the previous frame image in the target image information.

In the embodiment of the application, the terminal performs feature extraction processing on the target image information based on the target network model to obtain the target transformation matrix, and in some embodiments of the application, the terminal may perform feature extraction processing on a previous frame image in the target image information based on a target second sub-network in the target network model to obtain a target depth map corresponding to the previous frame image in the target image information.

It should be noted that, in the embodiment of the present application, the target second sub-network is a trained second sub-network; for example, the target second subnetwork is a trained deep predictive subnetwork.

It should be noted that, in the embodiment of the present application, the target depth map is a depth map obtained by performing feature extraction processing on a previous frame image in the target image information by using the target second sub-network; the target depth map may characterize depth information of a previous frame of image in the target image information.

And 103c, optimizing the initial transformation matrix by using the target depth map to obtain a target transformation matrix.

In the embodiment of the application, after performing feature extraction processing on target image information based on a target first sub-network in a target network model to obtain an initial transformation matrix, and performing feature extraction processing on a previous frame image in the target image information based on a target second sub-network in the target network model to obtain a target depth map corresponding to the previous frame image in the target image information, the terminal may perform optimization processing on the initial transformation matrix by using the target depth map to obtain the target transformation matrix.

It should be noted that, in the embodiment of the present application, the target depth map may be used to perform optimization processing on the initial transformation matrix, so that the obtained target transformation matrix has higher accuracy, and thus, the relative transformation relationship between the adjacent frame images in the target image information is described more accurately.

Further, in the embodiment of the present application, to evaluate the performance of the target network model, the present application may test the overall motion trajectory and depth value of the target network model using the test data set; fig. 10 is a schematic diagram nine of an implementation of the SLAM algorithm provided in the embodiment of the present application, and as shown in fig. 10, a specific test sample is a motion trail result of a test performed by using a 09 sequence in a KITTI data set; FIG. 11 is a schematic diagram showing an implementation of the SLAM algorithm according to the embodiment of the present application, and FIG. 11 shows a result of a motion trajectory tested by using 10 sequences in a KITTI data set; the group Truth is a label, namely, represents real data; ours represents data obtained using the target network model in embodiments of the present application. Fig. 12 is an eleventh implementation schematic diagram of the SLAM algorithm according to the embodiment of the present application, and fig. 12 is a depth map for one frame of image. Therefore, the method and the device are designed and added with the global attention module aiming at the challenging scene with the dynamic object, can capture the space global context information, weight the attention of the channel, have lower calculation complexity, and enable the target network model to have stronger accuracy and robustness.

Example III

Based on the above embodiments, in another embodiment of the present application, fig. 13 is a schematic diagram of the composition structure of the terminal according to the embodiment of the present application, as shown in fig. 13, the terminal 10 according to the embodiment of the present application may include an obtaining unit 11, a training unit 12, an obtaining unit 13 and a constructing unit 14,

the acquiring unit 11 is configured to acquire a training data set.

The training unit 12 is configured to perform self-supervision training processing according to the training data set and the initial network model, so as to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels.

The obtaining unit 13 is configured to perform feature extraction processing on the target image information based on the target network model, so as to obtain a target transformation matrix.

The construction unit 14 is configured to construct point cloud data based on the target transformation matrix.

Further, the global attention module is a global context attention GC module; the global attention module includes a first attention module and a second attention module.

Further, the obtaining unit 13 is further configured to perform global attention calculation on the target image information based on a target global attention module in the target network model, so as to complete the feature extraction processing.

Further, the obtaining unit 13 is further configured to obtain an attention weight through a first module in the target global attention modules, and multiply the attention weight with an original input feature to obtain a global context feature; obtaining a channel dependency relationship corresponding to the global context feature according to a second module in the target global attention module; the second module comprises a bottleneck conversion module, a normalization layer and an activation layer; and performing feature fusion processing according to the channel dependency relationship and the target image information to complete the global attention calculation.

Further, the initial network model includes a first subnetwork and a second subnetwork; the first subnetwork includes the first attention module; the second subnetwork includes the second attention module.

Further, the first sub-network is a pose regression sub-network; and the feature extraction module of the pose regression sub-network is a ResNet network.

Further, the second sub-network is a depth prediction sub-network; the depth prediction sub-network is a U-Net network; the U-Net network comprises an encoder and a decoder; the encoder is a ResNet network; the decoder is a DispNet network.

Further, the training unit 12 is further configured to obtain a depth map and a transformation matrix based on the adjacent frame images in the training dataset and the initial network model; the depth map characterizes the depth information of the previous frame image in the adjacent frame images; the transformation matrix characterizes the relative transformation relation between the adjacent frame images;

and obtaining a predicted image corresponding to a later frame image in the adjacent frame images according to the depth image and the transformation matrix, calculating a loss function value based on the predicted image, and performing self-supervision training treatment on the initial network model according to the loss function value to obtain the target network model.

Further, the training unit 12 is further configured to perform affine transformation processing according to camera internal parameter information, the depth map, and the transformation matrix, so as to obtain the predicted image.

Further, the training unit 12 is further configured to calculate a first loss function value based on the predicted image and a later frame image of the adjacent frame images; wherein the first loss function value characterizes a degree of difference between a subsequent one of the adjacent frame images and the predicted image; calculating a second loss function value based on the depth map; wherein the second loss function value characterizes a degree of change of the depth map; and determining the loss function value from the first loss function value and the second loss function value.

Further, the training unit 12 is further configured to calculate the first loss function value according to the predicted image, a later frame image of the adjacent frame images, and a structural similarity index SSIM.

Further, the training unit 12 is further configured to perform feature extraction processing on the adjacent frame image based on the first sub-network, to obtain the transformation matrix;

and carrying out feature extraction processing on the previous frame image based on the second sub-network to obtain the depth map.

Further, the ResNet network includes an input portion, an output portion, and an intermediate portion; wherein the intermediate portion is made up of stacked residual blocks.

Further, the obtaining unit 13 is further configured to perform feature extraction processing on the target image information based on a target first sub-network in the target network model, so as to obtain an initial transformation matrix; performing feature extraction processing on a previous frame image in the target image information based on a target second sub-network in the target network model to obtain a target depth map corresponding to the previous frame image in the target image information; and optimizing the initial transformation matrix by utilizing the target depth map to obtain the target transformation matrix.

Fig. 14 is a schematic diagram of a second component structure of the terminal according to the embodiment of the present application, as shown in fig. 14, the terminal according to the embodiment of the present application may further include a processor 15, a memory 16 storing instructions executable by the processor 15, and further, the terminal 20 may further include a communication interface 17, and a bus 18 for connecting the processor 15, the memory 16 and the communication interface 17.

In an embodiment of the present application, the processor 15 may be at least one of an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (ProgRAMmable Logic Device, PLD), a field programmable gate array (Field ProgRAMmable Gate Array, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronic device for implementing the above-mentioned processor function may be other for different apparatuses, and embodiments of the present application are not specifically limited. The processor 15 may further comprise a memory 16, which memory 16 may be connected to the processor 15, wherein the memory 16 is adapted to store executable program code comprising computer operating instructions, the memory 16 may comprise a high speed RAM memory, and may further comprise a non-volatile memory, e.g. at least two disk memories.

In the present embodiment, the bus 18 is used to connect the communication interface 17, the processor 15, and the memory 16, and the intercommunication among these devices.

In an embodiment of the present application, memory 16 is used to store instructions and data.

Further, in the embodiment of the present application, the processor 15 is configured to acquire a training data set;

and constructing point cloud data based on the target transformation matrix.

In practical applications, the Memory 16 may be a volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); or a combination of memories of the above kind and providing instructions and data to the processor 15.

In addition, each functional module in the present embodiment may be integrated in one analysis unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional modules.

The integrated units, if implemented in the form of software functional modules, may be stored in a computer-readable storage medium, if not sold or used as separate products, and based on this understanding, the technical solution of the present embodiment may be embodied essentially or partly in the form of a software product, which is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The embodiment of the application provides a terminal and a storage medium, wherein the terminal acquires a training data set; performing self-supervision training treatment according to the training data set and the initial network model to obtain a target network model; wherein the initial network model includes a global attention module; the global attention module is used for learning the importance degrees corresponding to different image blocks and the importance ranges corresponding to different channels; performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix; and constructing point cloud data based on the target transformation matrix. Therefore, the global attention module is added into the structure of the initial network model, so that the global attention module can learn the importance degrees corresponding to different image blocks and the importance degrees corresponding to different channels, and the target network model obtained after training can automatically 'notice' the importance degrees of different image areas when the feature extraction processing is carried out, and the model accuracy is greatly improved; and then, feature extraction processing is carried out on the target image information according to the target network model, and point cloud data is constructed based on the obtained target transformation matrix, so that the method has stronger robustness and accuracy.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Rather, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage and optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block and/or flow of the flowchart illustrations and/or block diagrams, and combinations of blocks and/or flow diagrams in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application.

Claims

1. A synchronous positioning and mapping SLAM algorithm, the method comprising:

acquiring a training data set;

and constructing point cloud data based on the target transformation matrix.

2. The method of claim 1, wherein the global attention module is a global context attention GC module; the global attention module includes a first attention module and a second attention module.

3. The method according to claim 2, wherein the feature extraction processing of the target image information based on the target network model includes:

and performing global attention calculation on the target image information based on a target global attention module in the target network model so as to complete the feature extraction processing.

4. The method of claim 3, wherein the global attention calculation of the target image information based on a target global attention module in the target network model comprises:

obtaining attention weight through a first module in the target global attention module, and multiplying the attention weight with original input characteristics to obtain global context characteristics;

Obtaining a channel dependency relationship corresponding to the global context characteristic according to a second module in the target global attention module; the second module comprises a bottleneck conversion module, a normalization layer and an activation layer;

and carrying out feature fusion processing on the target image information according to the channel dependency relationship so as to complete the global attention calculation.

5. The method of claim 2, wherein the initial network model comprises a first subnetwork and a second subnetwork; the first subnetwork includes the first attention module; the second subnetwork includes the second attention module.

6. The method of claim 2 or 5, wherein the first subnetwork is a pose regression subnetwork; and the feature extraction module of the pose regression sub-network is a ResNet network.

7. The method according to claim 2 or 5, wherein the second subnetwork is a depth prediction subnetwork; the depth prediction sub-network is a U-Net network; the U-Net network comprises an encoder and a decoder; the encoder is a ResNet network; the decoder is a DispNet network.

8. The method of claim 1, wherein the performing a self-supervised training process based on the training dataset and an initial network model to obtain a target network model comprises:

Obtaining a depth map and a transformation matrix based on adjacent frame images in the training dataset and the initial network model; the depth map characterizes the depth information of the previous frame image in the adjacent frame images; the transformation matrix characterizes the relative transformation relation between the adjacent frame images;

9. The method according to claim 8, wherein obtaining a predicted image corresponding to a subsequent frame image from among the neighboring frame images according to the depth map and the transformation matrix, comprises:

and carrying out affine transformation processing according to the camera internal reference information, the depth map and the transformation matrix to obtain the predicted image.

10. The method according to claim 8 or 9, wherein said calculating a loss function value based on said predicted image comprises:

calculating a first loss function value based on the predicted image and a subsequent frame image of the adjacent frame images; wherein the first loss function value characterizes a degree of difference between a subsequent one of the adjacent frame images and the predicted image;

Calculating a second loss function value based on the depth map; wherein the second loss function value characterizes a degree of variation of the depth map;

determining the loss function value from the first loss function value and the second loss function value.

11. The method of claim 10, wherein the calculating a first loss function value based on the predicted image and a subsequent one of the neighboring frame images comprises:

and calculating the first loss function value according to the predicted image, a later frame image in the adjacent frame images and a structural similarity index SSIM.

12. The method according to claim 5 or 8, wherein said obtaining a depth map and a transformation matrix based on adjacent frame images in the training dataset and the initial network model comprises:

performing feature extraction processing on the adjacent frame images based on the first sub-network to obtain the transformation matrix;

13. The method according to claim 6 or 7, wherein the res net network comprises an input portion, an output portion, and an intermediate portion; wherein the intermediate portion is made up of stacked residual blocks.

14. The method according to claim 1, wherein the performing feature extraction processing on the target image information based on the target network model to obtain a target transformation matrix includes:

performing feature extraction processing on the target image information based on a target first sub-network in the target network model to obtain an initial transformation matrix;

performing feature extraction processing on a previous frame image in the target image information based on a target second sub-network in the target network model to obtain a target depth map corresponding to the previous frame image in the target image information;

and optimizing the initial transformation matrix by using the target depth map to obtain the target transformation matrix.

15. A terminal is characterized by comprising an acquisition unit, a training unit, an acquisition unit and a construction unit,

the acquisition unit is used for acquiring a training data set;

16. A terminal comprising a processor, a memory storing instructions executable by the processor, which when executed by the processor, implement the method of any one of claims 1-14.

17. A computer storage medium having stored thereon a program which, when executed by a processor, implements the method of any of claims 1-14.