CN116817887B

CN116817887B - Semantic visual SLAM map construction method, electronic equipment and storage medium

Info

Publication number: CN116817887B
Application number: CN202310773986.5A
Authority: CN
Inventors: 万鲁河; 齐少群; 王志伟; 卢廷玉; 车丽娜
Original assignee: Harbin Normal University
Current assignee: Harbin Normal University
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2024-03-08
Anticipated expiration: 2043-06-28
Also published as: CN116817887A

Abstract

A semantic vision SLAM map construction method, electronic equipment and a storage medium belong to the technical field of computer vision. To improve the accuracy of camera pose and the integrity of camera track. The invention carries out internal and external parameter calibration on the camera, and then acquires sub map images; performing inverse perspective transformation on the sub-map image acquired in the step S1, and then performing sub-map image fusion and splicing to obtain an all-around sub-map image; inputting the all-around sub-map image obtained in the step S2 into a neural network for feature extraction, and then performing feature projection to obtain sub-map image coordinates under a world coordinate system; and (3) carrying out local mapping by utilizing the sub-map image coordinates under the world coordinate system obtained in the step (S3), loop detection and global optimization to obtain the semantic visual SLAM map. After the back-end optimization algorithm and the IMU sensor are added, the invention can obviously show that no obvious track drift exists between the algorithm result and the true value.

Description

Semantic visual SLAM map construction method, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a semantic vision SLAM map construction method, electronic equipment and a storage medium.

Background

In the current SLAM field, the three-dimensional laser radar+GNSS scheme has high cost, the positioning loss condition can occur under some special conditions (turning position radar point cloud drift and GPS signal loss under the condition of dense building groups), the camera sensor used by the visual semantic SLAM has low cost, can acquire semantic information of rich scenes and has the potential of solving the special conditions, but the current semantic SLAM research field aims at static scenes mostly, needs to assume that the scenes are static and are not suitable for dynamic scenes, needs to provide a geometric model of an object, provides an object database, can only detect the object appearing in the database, and can only cope with the tracking and dense reconstruction of the moving object at the single digit level. The existing visual SLAM is mostly optimized based on a VI-SLAM system, the problem of low recall rate of dynamic scenes due to lack of semantic information generally exists, in the aspects of detection and segmentation of the existing image target, the deeplabv3+ is used as the outstanding one of the problems, the ASPP is used for completing multi-scale target segmentation, but the context relation of global features cannot be acquired, and the capability of extracting high-grade semantic information through multiple channels is lacked. Therefore, it is necessary to combine the semantic SLAM, visual SLAM, and image object detection with the leading edge techniques of the three research fields of segmentation to explore an efficient SLAM scheme.

Disclosure of Invention

The invention aims to solve the problem of improving the accuracy of camera pose and the integrity of camera track, and provides a semantic visual SLAM map construction method, electronic equipment and a storage medium.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

a semantic visual SLAM map construction method comprises the following steps:

s1, calibrating internal and external parameters of a camera, and then collecting sub map images;

s2, carrying out inverse perspective transformation on the sub-map image acquired in the step S1, and then carrying out sub-map image fusion and splicing to obtain an all-around sub-map image;

s3, inputting the all-around sub-map image obtained in the step S2 into a neural network for feature extraction, and then performing feature projection to obtain sub-map image coordinates under a world coordinate system;

and S4, carrying out local map building by utilizing the sub-map image coordinates under the world coordinate system obtained in the step S3, detecting loop, and carrying out global optimization to obtain the semantic visual SLAM map.

Further, the specific implementation method for calibrating the internal and external parameters of the camera in the step S1 comprises the following steps:

s1.1, manufacturing a black-and-white checkerboard, attaching the black-and-white checkerboard to a plane to serve as a calibration object, and shooting the black-and-white checkerboard with a camera at different angles, wherein 20 black-and-white checkerboard images are shot;

s1.2, marking corner points of a black-and-white checkerboard from the black-and-white checkerboard image shot in the step S1.1;

s1.3, solving the parameters in the camera and the parameters outside the camera corresponding to each image shot in the step S1.1;

firstly, establishing a camera imaging model as follows:

wherein u and v are the horizontal and vertical coordinates of the object projected to the image coordinate system, s is a scale factor, K is an in-camera parameter, r ₁ 、r ₂ A first column vector and a second column vector which are converted by a rotation matrix, t is a translation matrix and X _W 、Y _W The coordinate system is an x-axis coordinate and a y-axis coordinate of an object in a real space coordinate system;

setting H as a homography matrix, and converting a camera imaging model into:

h is a 3x3 matrix, and the calculation expression of H is:

H＝A[r ₁ r ₂ t]

wherein A is an external parameter of the camera;

setting R as a rotation matrix, and obtaining the following formula:

s1.4, estimating the parameters in the camera and the parameters outside the camera obtained in the step S1.3 by using a least square method on the distortion coefficient under radial distortion;

and S1.5, optimizing the camera inner parameters and the camera outer parameters obtained in the step S1.4 based on a maximum likelihood method to obtain optimized camera inner parameters and optimized camera outer parameter coefficients.

Further, in step S1, the number of cameras is 6, and the fields of view of the 6 cameras cover 360 degrees.

Further, the specific implementation method of the step S2 includes the following steps:

s2.1, establishing forward projection of intersection points of an x axis and a y axis, which are overlapped with an optical center of a camera, on the ground, establishing a dynamic inverse perspective transformation equation when a pitching angle exists in the camera, and carrying out inverse perspective transformation on the acquired sub-map image, wherein the calculation expression is as follows:

wherein x (u, v) is the horizontal coordinate of the road surface of the top view of the reverse perspective transformation, y (u, v) is the vertical coordinate of the road surface of the top view of the reverse perspective transformation, u and v are the horizontal coordinate and the vertical coordinate of the perspective view of the sub map image in the image coordinate system, deltaz is the ground clearance of the camera, alpha is the course angle of the camera, beta is the rolling angle of the camera, and theta is the pitch angle of the camera;

s2.2, carrying out sub-map image fusion and splicing on the sub-map image subjected to the inverse perspective transformation obtained in the step S2.1 to obtain an all-around sub-map image, wherein the calculation expression is as follows:

wherein,to convert a pixel into a back projection of a spatial point, [ R ] _c ，t _c ]For external parameters of each camera, [ u v ]]For pixel position in the image coordinate system, x ^v 、y ^v The characteristic points in the vehicle coordinate system are x-axis coordinates and y-axis coordinates, lambda is a scalar quantity, and col is a column of a matrix.

Further, the specific implementation method of the step S3 includes the following steps:

s3.1, inputting the circular sub-map image obtained in the step S2 into a deeplabv3+ neural network model for semantic information feature extraction, and carrying out feature projection after obtaining feature points, wherein the calculation expression is as follows:

wherein K is _ipm For looking around sub-map image reference, [ u ] _ipm ，v _ipm ]Pixel coordinates on the look-around sub-map image;

s3.2, performing feature projection after obtaining feature points, and projecting features from an image coordinate system to a 3D coordinate system, wherein the calculation expression is as follows:

converting image features from a vehicle coordinate system to a world coordinate system by an odometer, and calculating the expression as follows:

wherein [ R _o ，t ₀ ]The pose value of the odometer.

Further, the specific implementation method of the step S4 includes the following steps:

s4.1, carrying out local mapping on the feature points of the feature projection obtained in the step S3.2, wherein a frame image is divided into two parts, a left block diagram is a local image, a right block diagram is a global image, and the local mapping is constructed by a series of iterative scanning structures, wherein the local mapping comprises information of scanning positions and sub-card information, so as to obtain a local sub-map;

s4.2, performing element point matching on the local sub-map obtained in the step S4.1 by using an ICP point cloud matching algorithm, initializing by using an IMU if the frame is the first frame, then performing IMU pre-integration, and then inserting a relative posture memory of a camera into the memory to perform loop detection and global optimization, wherein the calculation expression of the loop detection is as follows:

wherein,the pose of the local sub map, f is any one of m, and the map is a ++>For the pose of the camera, j is any one of n, and ζ _ij Sigma for corresponding covariance matrix for pose of feature point under local sub-map coordinate system _ij E is a residual error, and ρ is an optimization coefficient;

the calculation formula of the residual E is as follows:

the pose of the local sub map and the pose of the camera are obtained at the front end, and xi is calculated _ij Matching and obtaining by a loop camera;

global optimization calculates pose R, t by utilizing PnP according to the 3D coordinates of the common view point and the camera internal reference matrix, calculates the projection of the characteristic points on the second frame image, calculates the pose by R, t and a projection model of the camera, optimizes the pose by a projection building function, and obtains optimization of camera parameters and coordinates of three-dimensional space points by minimizing the difference between the projection and the re-projection of the real three-dimensional space points on the image plane by re-projection errors;

then converting the image features from the vehicle coordinate system to the world coordinate system by an odometer, and calculating the expression as follows:

wherein [ R _o ，t ₀ ]The pose value of the odometer;

and S4.3, loop detection and global optimization in the step S4.2 are used as priori information of the sliding window to optimize the sliding window, then IMU factors and visual odometer factors of the pose are generated by factor graph optimization IMU pre-integration, the final pose is obtained, and a track graph is generated.

Further, the loop detection and global optimization process of step S4.2 constructs a bag-of-words model, and the specific implementation method includes the following steps:

s4.2.1 constructing a dictionary by using a K-means method, creating a dictionary on all the attributes of the feature points, setting all the nodes as K types at the root node position, creating a first layer cluster of all the feature points according to the K types, and then clustering a second layer and each subsequent layer to obtain leaf nodes, forming words, and obtaining a K-ary tree model;

s4.2.2 dividing all objects into d layers based on the k-ary tree model constructed in the step S4.1, wherein each layer is provided with k node trees, the maximum dk word numbers can be contained, when an image is input, comparing the characteristic points of the image with each cluster of the cluster layers, and finally classifying all points of the image into words;

s4.2.3 correcting the weight of the object point in the image by TF-IDF method, defining the word of checking leaf node in dictionary for the ith feature point in picture A, recording as a, and the number of times of word appearing in picture A as n _i Next, the total number of occurrences of all feature points in image a is n _a Defining the number of words in the dictionary as n, and the weight of the words as eta _i Wherein the relational expression is as follows:

η _i ＝TF _i ×IDF _i

wherein, IDF _i For inverse text frequency index, TF _i The number of times a word appears in the image of the present frame;

s4.2.4, when the dictionary obtained in the step S4.2.3 has consensus points, the picture descriptions are expressed as the same vector, the similarity of the pictures is calculated, and the calculation formula is as follows:

wherein b is a comparison picture, w is s (a, b) is a picture similarity;

the electronic equipment comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the semantic visual SLAM map construction method when executing the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the semantic visual SLAM map construction method.

The invention has the beneficial effects that:

the semantic vision SLAM map construction method has the advantages of high robustness and recall rate, and higher precision and overall stability than a laser radar+GNSS SLAM scheme under certain special conditions (turning position radar point cloud drifting and GPS signal losing under the condition of dense building groups). In some cornering road situations and in some dynamic scenarios, semantic vision SLAM is more advantageous. On the basis of the original laser radar and GNSS, the semantic vision SLAM is fused, and better auxiliary effects can be achieved in the aspects of map building and positioning.

Drawings

FIG. 1 is a flow chart of a semantic visual SLAM map building method according to the present invention;

FIG. 2 is a schematic diagram showing the influence of yaw and pitch angles on inverse perspective transformation in a semantic visual SLAM map construction method according to the present invention;

FIG. 3 is a schematic view of a visual odometer according to the present invention for a semantic visual SLAM map construction method;

FIG. 4 is a K-ary tree model of a semantic visual SLAM map building method according to the present invention;

FIG. 5 is a graph of trace error versus true for a semantic visual SLAM according to the method of the present invention;

FIG. 6 is a graph of a semantic visual SLAM versus true position error for a semantic visual SLAM map construction method according to the present invention

Fig. 7 is a view of a visual inertia SLAM trajectory error map of the semantic visual SLAM map construction method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and detailed description. It should be understood that the embodiments described herein are for purposes of illustration only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein can be arranged and designed in a wide variety of different configurations, and the present invention can have other embodiments as well.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

For further understanding of the invention, the following detailed description is to be taken in conjunction with the accompanying drawings 1-7, in which:

the first embodiment is as follows:

a semantic visual SLAM map construction method comprises the following steps:

firstly, establishing a camera imaging model as follows:

setting H as a homography matrix, and converting a camera imaging model into:

h is a 3x3 matrix, and the calculation expression of H is:

H＝A[r ₁ r ₂ t]

wherein A is an external parameter of the camera;

h is a 3x3 matrix, one element is used as homogeneous coordinates, 8 unknown elements are provided, one group of coordinates corresponds to two equations, and at least four groups of corresponding points are needed to calculate a homography matrix H;

the matrix a contains 5 elements, and the unique closed-form solution for a requires 3 sets of H to solve. The effect of camera distortion is ignored in the above calculations. Estimating the internal and external parameters by using a least square method according to the distortion coefficient of the radial distortion (neglecting tangential distortion) which exists in practice, and finally optimizing by using a maximum likelihood method so as to obtain a solution with higher precision;

setting R as a rotation matrix, and obtaining the following formula:

s1.5, optimizing the camera inner parameters and the camera outer parameters obtained in the step S1.4 based on a maximum likelihood method to obtain optimized camera inner parameters and optimized camera outer parameter coefficients;

further, in step S1, the number of cameras is 6, and the fields of view of the 6 cameras cover 360 degrees;

wherein,to convert a pixel into a back projection of a spatial point, [ R ] _c ，t _c ]For external parameters of each camera, [ u v ]]For pixel position in the image coordinate system, x ^v 、y ^v The coordinate system is characterized in that the coordinate system is characterized by an x-axis coordinate and a y-axis coordinate of a feature point under a vehicle coordinate system, lambda is a scalar, and col is a column of a matrix;

converting image features from vehicle coordinates to a world coordinate system by an odometer, and calculating an expression as follows:

wherein [ R _o ，t ₀ ]The pose value of the odometer;

further, training and learning are carried out by using a deeplabv3+ neural network model dataset, a special dataset for surface information in a closed park is manufactured, the dataset has 5600 images in total and 5 categories, namely a road boundary line, a guide line, a deceleration strip, a zebra crossing and a guide word, on the basis of the deeplabv3+, a double-attention mechanism module is connected in parallel in a pyramid pool of a cavity space, the parallel connection between DAMM and ASPP is carried out firstly by using a backbone network, then the model is split into two branch networks, the feature images extracted from the backbone network are respectively processed, then the two branch feature images are combined, and the size of the cavity convolution is adjusted according to the special condition of the dataset, so that the model is more suitable for a specific task. The self-made data set is used as a training verification data set of the optimized neural network model, in order to enable the network to be converged and stable rapidly, the hundred-degree open-source lane line data set is used for pre-training, and then the pre-training model is used for carrying out migration training on the original deeplabv3+ and the network added with the self-attention mechanism, and the network with the cavity convolution size changed. Experiments prove that the neural network model with the changed cavity convolution size and the self-attention mechanism has higher semantic segmentation precision, better effect on edge characteristics and finally adopts the trained neural network model weight to extract the characteristics.

S4, carrying out local map building by utilizing the sub-map image coordinates under the world coordinate system obtained in the step S3, loop detection and global optimization to obtain a semantic visual SLAM map;

wherein,for the pose of the local sub-map, i is any one of m,/is +.>For the pose of the camera, j is any one of n, and ζ _ij Sigma for corresponding covariance matrix for pose of feature point under local sub-map coordinate system _ij E is a residual error, and ρ is an optimization coefficient;

the calculation formula of the residual E is as follows:

wherein [ R _o ，t ₀ ]The pose value of the odometer;

η _i ＝TF _i ×IDF _i

wherein b is a comparison picture, w is s (a, b) is a picture similarity;

Inputting the fused looking-around image into a neural network model for feature extraction, matching by using an ICP point cloud matching algorithm, initializing the frames together through an IMU if the frames are the first frames, screening a sliding window with a fixed size after the pose of the frames is obtained, and optimizing the sliding window by taking IMU pre-integration, loop detection constraint and the like as priori information of the sliding window. The method is used for large-scale positioning, a certain number of frames are kept to be optimized through a sliding window, the influence of excessive pose and map points caused by overlarge map area is solved, and finally the pose output by the sliding window is used as a visual inertial odometer factor.

Firstly, loop-back detection is mainly used for solving the error problem between the actual path and the predicted path of the mobile robot, so that the error exists between the created map and the actual map, and even if the algorithm is fully optimized, the error problem in the matching process between part of the map and the camera is unavoidable. In order to reduce the searching range of the matching degree of the sub map and the camera, thereby reducing the operation amount, improving the detection efficiency and better completing the task of loop detection, the embodiment introduces the data of the odometer. To ensure that the current camera element matches an element point in the sub-map that built the element, the relative pose memory of the camera must be inserted into memory for loop-back optimization. Except for these relative poses, all other pairs of cameras and sub-maps are considered closed-loop unless the sub-maps are altered. The scan matcher runs in the background and if a match is found, the corresponding relative pose will be added to the optimization problem.

In Visual SLAM, the input data is considered as a point cloud feature. Subtracting the feature points directly from the two pictures is unfavorable for obtaining and storing the difference between the two pictures. The front end must contribute enough frames to achieve a match, creating more global memory, and the more historical key frames the match reaches the same location, the more matches the number. And the function points directly used at the front end are not classified and are difficult to find. Therefore, in order to solve the loop detection problem in the visual SLAM, the present embodiment adopts a method of a bag-of-word model. The structure of the bag of words dictionary is discussed first, and implementation details are discussed again. The bag of words model is essentially a model of individual feature points on a classified image. Global optimization is a major feature of loop detection, so that each key frame input uses a BOW model to output its feature vector (if history frames are used), and if no history frames match, key frame words need to be input in a dictionary for storage, which does not affect the subsequent matching performance.

In the embodiment, the constructed semantic map and the simulation truth value are compared by using the EVO tool, and as the track error is compared with the track error and the position error which can be obtained by the EVO tool in the experiment, as shown in the figures 5 and 6, IMU sensor data is added in a pure visual semantic algorithm for a second experiment, so that the problem of overlarge deviation caused by a visual front-end odometer is solved. The camera is low frequency input, 10Hz, and the imu data is high frequency input, 100Hz. The IMU data is pre-integrated while adding sliding window based optimization, and the sliding window is utilized to edge old frames to reduce accumulated errors. The initial value of the sliding window is optimized by referring to the pose measured by the IMU, so that the initialization efficiency of the whole system is improved. And finally, optimizing IMU factors and visual odometer factors of the IMU pre-integration generating pose by using the factor graph to obtain the final pose and generate a track graph. The track error is shown in fig. 7. After the IMU sensor is added through the back-end optimization algorithm, the fact that no obvious track drift exists between the algorithm result and the true value can be obviously seen. Absolute position error experimental results tables including mean error, minimum error, root mean square error RMSE and maximum error of the visual semantic inertia algorithm and the pure visual semantic algorithm were calculated herein using EVO algorithm, as shown in table 1. In the experiment, the root mean square error of the pure visual semantic algorithm is 11.24, while the root mean square error of the visual semantic inertia algorithm of the present embodiment is 2.014.

Table 1 absolute position error experiment results table

The second embodiment is as follows:

The computer device of the present invention may be a device including a processor and a memory, such as a single chip microcomputer including a central processing unit. And the processor is used for realizing the steps of the semantic visual SLAM map construction method when executing the computer program stored in the memory.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

And a third specific embodiment:

The computer readable storage medium of the present invention may be any form of storage medium that is read by a processor of a computer device, including but not limited to a nonvolatile memory, a volatile memory, a ferroelectric memory, etc., on which a computer program is stored, and when the processor of the computer device reads and executes the computer program stored in the memory, the steps of a semantic visual SLAM map construction method described above may be implemented.

The computer program comprises computer program code which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Although the present application has been described hereinabove with reference to specific embodiments, various modifications thereof may be made and equivalents may be substituted for elements thereof without departing from the scope of the application. In particular, the features of the embodiments disclosed in this application may be combined with each other in any way as long as there is no structural conflict, and the exhaustive description of these combinations is not given in this specification merely for the sake of omitting the sake of brevity and saving resources. Therefore, it is intended that the present application not be limited to the particular embodiments disclosed, but that the present application include all embodiments falling within the scope of the appended claims.

Claims

1. The semantic visual SLAM map construction method is characterized by comprising the following steps of:

the specific implementation method for calibrating the internal and external parameters of the camera in the step S1 comprises the following steps:

firstly, establishing a camera imaging model as follows:

setting H as a homography matrix, and converting a camera imaging model into:

h is a 3x3 matrix, and the calculation expression of H is:

H＝A[r ₁ r ₂ t]

wherein A is an external parameter of the camera;

setting R as a rotation matrix, and obtaining the following formula:

the specific implementation method of the step S2 comprises the following steps:

wherein,to convert a pixel into a back projection of a spatial point, [ R ] _c ,t _c ]For external parameters of each camera, [ u v ]]For pixel position in the image coordinate system, x ^v 、y ^v The coordinate system is characterized in that the coordinate system is characterized by an x-axis coordinate system and a y-axis coordinate system of a vehicle, lambda is a scalar, and col is a column of a matrix;

2. The semantic visual SLAM map construction method of claim 1, wherein in step S1, the number of cameras is 6, and the fields of view of the 6 cameras cover 360 degrees.

3. The semantic visual SLAM map construction method of claim 2, wherein the specific implementation method of step S3 comprises the following steps:

wherein [ R _o ，t ₀ ]The pose value of the odometer.

4. The semantic visual SLAM map construction method of claim 3, wherein the specific implementation method of step S4 comprises the following steps:

the calculation formula of the residual E is as follows:

then converting the image features from the vehicle coordinates to a world coordinate system by an odometer, and calculating the expression as follows:

wherein [ R _o ，t ₀ ]The pose value of the odometer;

5. The semantic visual SLAM map building method according to claim 4, wherein the loop detection and global optimization process of step S4.2 builds a bag-of-words model, and the specific implementation method comprises the following steps:

s4.2.2 dividing all objects into d layers based on the k-ary tree model constructed in the step S4.1, wherein each layer is provided with k node trees, the maximum dk word numbers are contained, when an image is input, comparing the characteristic points of the image with each cluster of the cluster layers, and finally classifying all points of the image into words;

η _i ＝TF _i ×IDF _i

wherein b is a comparison picture, w is S (a, b) is a picture similarity.

6. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of a semantic visual SLAM map construction method according to any of claims 1-5 when the computer program is executed.

7. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a semantic visual SLAM mapping method according to any of claims 1-5.