CN112365604A - AR equipment depth of field information application method based on semantic segmentation and SLAM - Google Patents
AR equipment depth of field information application method based on semantic segmentation and SLAM Download PDFInfo
- Publication number
- CN112365604A CN112365604A CN202011224040.6A CN202011224040A CN112365604A CN 112365604 A CN112365604 A CN 112365604A CN 202011224040 A CN202011224040 A CN 202011224040A CN 112365604 A CN112365604 A CN 112365604A
- Authority
- CN
- China
- Prior art keywords
- semantic segmentation
- map
- image
- slam
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 89
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000005516 engineering process Methods 0.000 claims abstract description 25
- 230000004913 activation Effects 0.000 claims description 33
- 230000000007 visual effect Effects 0.000 claims description 17
- 238000013527 convolutional neural network Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 10
- 238000005457 optimization Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000007794 visualization technique Methods 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000003709 image segmentation Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000013526 transfer learning Methods 0.000 claims description 3
- 230000008878 coupling Effects 0.000 abstract description 5
- 238000010168 coupling process Methods 0.000 abstract description 5
- 238000005859 coupling reaction Methods 0.000 abstract description 5
- 102000008115 Signaling Lymphocytic Activation Molecule Family Member 1 Human genes 0.000 description 41
- 108010074687 Signaling Lymphocytic Activation Molecule Family Member 1 Proteins 0.000 description 41
- 238000004364 calculation method Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 239000002360 explosive Substances 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 235000012736 patent blue V Nutrition 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 210000001525 retina Anatomy 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000002366 time-of-flight method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/006—Mixed reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Graphics (AREA)
- Computer Hardware Design (AREA)
- Image Analysis (AREA)
Abstract
The application relates to an AR equipment depth of field information application method based on semantic segmentation and SLAM; the method comprises the following steps: designing a semantic segmentation model for a front camera of the AR equipment, and segmenting each object in the scene in front of eyes of a user through the semantic segmentation model to obtain a semantic segmentation image; planning and prompting enhancement are carried out on a target which needs to be noticed by a user through an SLAM technology, and a depth image of the environment of the user is obtained; and fusing the depth image and the semantic segmentation image. The binocular fisheye camera is used for realizing the SLAM of pure vision, the map is dynamically constructed through a dense vision method, the positioning and the depth segmentation of each object in the environment are realized, the coupling of virtual information and the real environment is realized, and the depth image and the semantic segmentation image are fused and then the user is guided to watch and observe the objects with different depths so as to adjust the vision of the user.
Description
Technical Field
The application relates to the technical field of artificial intelligence image processing, in particular to an AR equipment depth of field information application method based on semantic segmentation and SLAM.
Background
With the development of information technology, life increasingly tends to use the Internet of Things (IoT) to realize digital life of smart life concepts, such as smart home systems, personal health monitoring, or extensive machine-to-machine communication. Augmented Reality (AR) is a core technology that facilitates human integration into such systems, and is a technology that combines virtual reality with reality, providing an interface for people to interact with the digital world of smart life. Although AR is not ready for deployment in the fields of medicine, productive life, industrial design, etc., it has been used in other fields such as entertainment. In recent years, the explosive growth of electronic miniaturization and the explosive enhancement of computing power have made it possible to develop AR systems with capabilities related to consumers and industries. The AR system enables humans to access digital information through a layer of information located above the physical world. According to a widely used reality-virtual continuum, the AR is located between a real environment and a virtual environment, namely the real world environment and the virtual reality environment, the position of the system in the environment and the positions of objects in the environment are accurately calculated through the positions of the camera and the sensor, and virtual information is combined and interacted with a real scene through an image analysis technology. In general, the basic components of an AR system are visualization technology, a sensor system, a tracking system, a processing unit and a user interface. Visualization technologies can visualize digital information in a real environment, mainly including four technologies, namely head-mounted displays, handheld devices, static screens and projectors; the sensing system functions to acquire information from the environment, and for most systems, its central input is one or more cameras, including ordinary optical cameras, infrared cameras, depth cameras, etc.; the tracking system is the key of the system, so that the digital object can be accurately placed in the physical world; the user interface is an interactive mode for realizing two-way communication between the system and the user, such as force feedback and sound prompt output by the system and input by the user; the processing unit is responsible for executing software to run the AR system. However, the current AR systems can be generally divided into two types, i.e., virtual-real combination based on the marked points (anchor points) and coupling by a non-marked method. The former was more common and mature earlier, but the way of marking points makes AR applications quite limited; the latter is coupled with a tracking and positioning algorithm through a sensor, and depends on the performance of hardware, and meanwhile, the trade-off between the complexity of the sensor and the system and the trade-off between the accuracy of the algorithm and the performance of the hardware make the system difficult to obtain an ideal effect.
Therefore, augmented reality is expected to become a future general computing platform, and the sensing and tracking system in AR is an indispensable and most critical ring, and the component has a synchronous positioning and mapping technology (SLAM) of a sensor of a hardware part and a software algorithm part. The SLAM is incremental map construction according to its own position through environmental features continuously observed in the system motion process, and simultaneously, the SLAM can more finely construct the position and the posture of each object in a three-dimensional space of the environment besides a simple two-dimensional plane SLAM (planar map) and a three-dimensional stereo SLAM, so that the coupling between virtual digital information in the AR system and the real environment is greatly facilitated. However, due to factors such as the structural nature and cost of the AR device, the sensors in the system are mostly purely visual, i.e., based on a purely visual SLAM. Unlike radar-based SLAM and hybrid SLAM, the latter can often construct a map with higher progress through radar, and the former has richer semantic information due to collected data, but has high operation cost and performance to be improved. There is therefore a need to further improve pure visual SLAM in an AR-based application environment and to exploit semantic information in the environment.
In an AR system in the prior art, an algorithm model is large in size, large in occupied computational power, light in weight and limited in hardware computational power, and has no pertinence when required to be real-time. Although the laser SLAM is mature in technology and high in reliability, the cost is high, and the radar scanning range and the installation structure are limited. And are bulky and unsuitable for AR systems.
Disclosure of Invention
Based on this, it is necessary to provide an AR device depth of field information application method based on semantic segmentation and SLAM, aiming at the problems of large size and large computational power of the existing algorithm.
In order to achieve the above object, an embodiment of the present application provides an AR device depth information application method based on semantic segmentation and SLAM, including:
designing a semantic segmentation model for a front camera of the AR equipment, and segmenting each object in the scene in front of eyes of a user through the semantic segmentation model to obtain a semantic segmentation image;
planning and prompting enhancement are carried out on a target which needs to be noticed by a user through an SLAM technology, and a depth image of the environment of the user is obtained;
and fusing the depth image and the semantic segmentation image.
Preferably, before designing a semantic segmentation model for the front camera of the AR device, segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, the method further includes:
the feature graph of the last convolution of the convolutional neural network model is subjected to back propagation through a convolutional neural network visualization method to calculate corresponding weight, each feature graph is multiplied by the weight to obtain a feature graph with weight, the average value of the feature graphs is calculated, and up-sampling is carried out to obtain a fine annotation learned by the coarse annotation, so that training of the convolutional neural network model based on weak supervision can be carried out by using the coarse annotation.
Preferably, the coarse annotation comprises a bounding box or label and the fine annotation comprises a heat map or mask.
Preferably, the designing a semantic segmentation model for the front camera of the AR device, segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, and planning and enhancing a target that the user needs to pay attention to through an SLAM technique, and obtaining the depth image of the environment of the user further includes:
and the user randomly unfreezes the rear-end part parameters of the semantic segmentation model according to the self environment to realize self-definition.
Preferably, the designing a semantic segmentation model for the front camera of the AR device, and segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image includes:
pre-training a semantic segmentation model by using a data set containing all categories, performing transfer learning after a convolutional neural network layer fully learns the textures of various images, and performing targeted training on data with high AR equipment correlation;
designing a semantic segmentation model by adopting a cavity convolution and a space pyramid type cavity pooling structure, wherein the cavity convolution part uses a multi-scale region for object positioning, combines a plurality of shrinkage integral branches with different expansion degrees together and performs image segmentation by using a multi-scale characteristic;
and separating each object in the scene in front of the eyes of the user by using a fully connected conditional random field at the rear end of the semantic segmentation model.
Preferably, the SLAM technology is a visual SLAM technology based on a binocular fish-eye camera, and the ORB-SLAM 3-based system is constructed by the visual SLAM technology.
Preferably, the ORB-SLAM 3-based system includes:
the map set is used for maintaining an activation map through a mixed map set consisting of a series of discrete maps to locate a new key frame, and continuously optimizing and updating the activation map through a local mapping process;
the tracking algorithm is used for calculating the reprojection error of the minimized matching feature points and screening the key frames by processing the data of the sensor and calculating the poses of the current frame and the activation map in real time, when the system loses tracking, the mixed map set is used for repositioning, if repositioning is successful, tracking is continued, and if repositioning fails, a new activation map is reinitialized for tracking and mapping;
a local map building for optimizing a map by adding key frames, feature points to an activation map, deleting redundant frames and using BA optimization of visual or inertial navigation;
and the cycle and the map fusion are used for detecting the same area in the dynamic activation map and the mixed map set, if the same area is in the activation map, a closed-loop process is executed, the activation map is optimized through a global BA after the closed-loop process is executed, and if the activation map and the mixed map set do not have the same map, the activation map and the mixed map set are fused into one map.
Preferably, the ORB-SLAM3 based system estimates the parameters to initialize and optimize the IMU using the method in local mapping with inertial navigation.
Preferably, the BA optimization step includes:
calculating a normalized space point coordinate corresponding to the pixel coordinate on the first image aiming at the matched corresponding pixel coordinate of the first image and the second image on the activation map;
and calculating the pixel coordinate re-projected on the second image according to the space point coordinate, and if the re-projected pixel coordinate is not completely overlapped with the pixel coordinate on the matched second image, establishing an equation parallel cubic equation for each matched pixel coordinate to form an overdetermined equation, and solving an optimal pose matrix or space point coordinate.
Preferably, the semantic segmentation model is improved on the basis of a deplab model.
One of the above technical solutions has the following advantages and beneficial effects:
the AR equipment depth of field information application method based on semantic segmentation and SLAM provided by the embodiments of the application effectively solves the problems of large size and large occupied computing power of the existing algorithm, meanwhile, segmented objects which are not suitable for an AR environment are abandoned, a part of segmentation categories under weak supervision are combined, the size of the algorithm is reduced, a depth segmentation image formed by combining a depth image dominated by SLAM and a segmentation image of semantic segmentation is guided to a user to watch objects at different depths through the depth segmentation image so as to adjust the vision of the user, and the experience of the user is enhanced.
Drawings
Fig. 1 is a schematic flowchart of an AR device depth of field information application method based on semantic segmentation and SLAM in an embodiment.
Detailed Description
To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present application are shown in the drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
It will be understood that when an element is referred to as being "connected" to another element, it can be directly connected to the other element and be integral therewith, or intervening elements may also be present. The terms "one end," "the other end," and the like are used herein for illustrative purposes only.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
In order to solve the problem that the conventional technology cannot verify the reliability and feasibility of an artificial retina product, in one embodiment, as shown in fig. 1, a method for applying depth information of an AR device based on semantic segmentation and SLAM is provided, and includes:
s100, designing a semantic segmentation model for a front camera of the AR equipment, and segmenting each object in the scene in front of eyes of a user through the semantic segmentation model to obtain a semantic segmentation image;
s200, planning and prompting a target to be noticed by a user through an SLAM technology to obtain a depth image of the environment of the user;
and S300, fusing the depth image and the semantic segmentation image.
The semantic segmentation is to classify each pixel point of the picture, in short, the picture is composed of the pixel points, and the semantic segmentation is to read the picture by using an artificial intelligence algorithm and classify the pixel points belonging to the same object. For example, a shot picture is taken as an input picture, different color blocks are divided into result pictures after semantic segmentation, and different things in the picture shot by the camera are automatically classified through an artificial intelligent image processing algorithm, such as tree coverage yellow, all building coverage red, all automobile coverage purple, road gray and sidewalk sky blue. Semantic segmentation can segment each object in the scene in front of the eyes of the user, and can help the user to distinguish the targets.
After semantic segmentation, planning and prompting enhancement are carried out on the target which needs to be noticed by the user. A binocular fisheye camera arranged in front of the system performs matching calculation through the distance between the cameras to obtain a scale and a space model, then, on the basis, map construction and object positioning are performed, and a depth image of the user's own environment is obtained. And finally, combining with the result of semantic segmentation to obtain a depth segmentation image, and coupling with a real scene to realize the depth-of-field-based attention guidance effect.
The AR equipment is not limited to glasses, and can be an intelligent terminal such as a mobile phone.
In specific implementation, before designing a semantic segmentation model for the front camera of the AR device and segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, the method further includes:
the method comprises the steps of performing back propagation on a feature map of the last convolution of a convolution neural network model through a convolution neural network visualization method to calculate corresponding weight, multiplying each feature map by the weight to obtain a feature map with the weight, calculating an average value of the feature maps, and performing up-sampling to obtain a fine annotation learned by a coarse annotation, so that training of the convolution neural network model based on weak supervision can be performed by using the coarse annotation, wherein the coarse annotation comprises a boundary box or a label, and the fine annotation comprises a heat map or a mask.
The method is a data enhancement method based on weak supervision, the invention innovatively proposes to enhance data by using a weak supervision mode, generally, the weak annotation data can be segmented directly based on the weak supervision mode, and the invention continues to segment the weak annotation for the next time by taking the segmented result as the annotation after segmenting the weak annotation, namely, the mode of self supervision.
And (3) leading Grad-CAM (convolutional neural network) visualization method, wherein the number of the feature maps of the last convolution is equal to the number of the types of the data to be classified, and each feature map represents the probability map of each type.
In specific implementation, the designing a semantic segmentation model for the front camera of the AR device, segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, and planning and enhancing a target which needs attention of the user through an SLAM technology, wherein the obtaining of the depth image of the environment of the user further includes:
and the user randomly unfreezes the rear-end part parameters of the semantic segmentation model according to the self environment to realize self-definition. And the user-defined model fine adjustment is carried out according to the self environment of the user, so that the use experience of the user is gradually improved through long-time use with low calculation power.
In specific implementation, the designing a semantic segmentation model for the front camera of the AR device, and segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image includes:
pre-training a semantic segmentation model by using a data set containing all categories, performing transfer learning after a convolutional neural network layer fully learns the textures of various images, and performing targeted training on data with high AR equipment correlation;
designing a semantic segmentation model by adopting a cavity convolution and a space pyramid type cavity pooling structure, wherein the cavity convolution part uses a multi-scale region for object positioning, combines a plurality of shrinkage integral branches with different expansion degrees together and performs image segmentation by using a multi-scale characteristic;
and separating each object in the scene in front of the eyes of the user by using a fully connected conditional random field at the rear end of the semantic segmentation model.
The deep learning method is a convolutional neural network, and the first popular segmentation method for deep learning is a patch classification (patch classification). The central pixel is classified by extracting surrounding pixels on a pixel-by-pixel basis. Since all the convolutional network ends at the time use fully connected layers (full connected layers), only this pixel-by-pixel segmentation method can be used.
In specific implementation, the SLAM technology is a visual SLAM technology based on a binocular fisheye camera, and a system based on ORB-SLAM3 is constructed through the visual SLAM technology. The system encodes the system for repositioning, closed-loop detection and map fusion based on a bag-of-words model, and can operate robustly in a pure vision or vision inertial navigation system.
The visual SLAM technology is low in cost and free of constraints, but is mostly suitable for road segmentation and map construction outdoors, fine results cannot be easily obtained for environments with complex indoor illumination changes, various obstacles and small closed loops, and most of the visual SLAMs can be made only by a dense visual method based on an RGB-D camera. The camera is classified into a monocular camera SLAM, a binocular camera SLAM, an RGB-D camera SLAM, and the like. The monocular camera SLAM has a scale for the track and the map with the real size, and cannot sense the real depth, so the initialization is needed; under the condition that the base line distance between two monocular cameras is known, the binocular camera SLAM can obtain the depth through calibration matching and calculation, but the calculation is wasted generally; the RGB-D camera SLAM, also called depth camera SLAM, can directly get depth information by stereo structured light and TOF techniques, monocular, binocular and RGB-D utilize pinhole or fish eye models, and can also define models by themselves.
The method used for SLAM may be a direct method such as a dense vision method and a semi-dense vision method.
In specific implementation, the ORB-SLAM 3-based system includes:
the map set is used for maintaining an activation map through a mixed map set consisting of a series of discrete maps to locate a new key frame, and continuously optimizing and updating the activation map through a local mapping process;
the tracking algorithm is used for calculating the reprojection error of the minimized matching feature points and screening the key frames by processing the data of the sensor and calculating the poses of the current frame and the activation map in real time, when the system loses tracking, the mixed map set is used for repositioning, if repositioning is successful, tracking is continued, and if repositioning fails, a new activation map is reinitialized for tracking and mapping;
a local map building for optimizing a map by adding key frames, feature points to an activation map, deleting redundant frames and using BA optimization of visual or inertial navigation;
and the cycle and the map fusion are used for detecting the same area in the dynamic activation map and the mixed map set, if the same area is in the activation map, a closed-loop process is executed, the activation map is optimized through a global BA after the closed-loop process is executed, and if the activation map and the mixed map set do not have the same map, the activation map and the mixed map set are fused into one map.
In concrete implementation, BA optimization in SLAM first calculates normalized spatial point coordinates corresponding to pixel coordinates on an a image according to a camera model and A, B image feature matched pixel coordinates, then calculates pixel coordinates re-projected onto a B image according to the spatial point coordinates, the re-projected pixel coordinates (estimated value) and the matched pixel coordinates (measured value) on the B image do not completely coincide, and the purpose of BA is to establish an equation for each matched feature point, then form an over-determined equation in a simultaneous manner, and solve an optimal pose matrix or spatial point coordinates (both can be optimized simultaneously).
In particular, the ORB-SLAM 3-based system estimates the parameters for initializing and optimizing IMUs using the method in local mapping with inertial navigation.
In specific implementation, the BA optimization step includes:
calculating a normalized space point coordinate corresponding to the pixel coordinate on the first image aiming at the matched corresponding pixel coordinate of the first image and the second image on the activation map;
and calculating the pixel coordinate re-projected on the second image according to the space point coordinate, and if the re-projected pixel coordinate is not completely overlapped with the pixel coordinate on the matched second image, establishing an equation parallel cubic equation for each matched pixel coordinate to form an overdetermined equation, and solving an optimal pose matrix or space point coordinate.
In specific implementation, the semantic segmentation model is improved on the basis of the deplab model, extensive learning is performed on a known open-source data set, then labels with low relevance are reduced, and secondary training is performed. Wherein the dataset is subjected to a weakly supervised based data enhancement in a pre-processing stage, it is thus also applicable to image classification datasets comprising labeled bounding boxes in addition to image segmented datasets. In the using process, an online learning mechanism can be added, and the segmentation and positioning effects of the user environment are enhanced.
In summary, the AR equipment depth of field information application method based on semantic segmentation and SLAM provided by the invention has the advantages that the parameter pruning of AR application solves the problems of large size and large occupied calculation capacity of the existing algorithm, meanwhile, segmented objects which are not suitable for an AR environment are abandoned, a part of segmentation categories under weak supervision are combined, a part of object labels of the objects are collectively called as obstacles, and the size of the algorithm is reduced; meanwhile, the invention uses a binocular fisheye camera to realize pure visual SLAM, dynamically constructs a map by a dense visual method, positions and deeply segments each object in the environment, realizes the coupling of virtual information and a real environment, and guides a user to watch and observe objects with different depths through a deeply segmented image so as to adjust the vision of the user.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. An AR equipment depth of field information application method based on semantic segmentation and SLAM is characterized by comprising the following steps:
designing a semantic segmentation model for a front camera of the AR equipment, and segmenting each object in the scene in front of eyes of a user through the semantic segmentation model to obtain a semantic segmentation image;
planning and prompting enhancement are carried out on a target which needs to be noticed by a user through an SLAM technology, and a depth image of the environment of the user is obtained;
and fusing the depth image and the semantic segmentation image.
2. The method for applying the depth-of-field information of the AR device based on semantic segmentation and SLAM as claimed in claim 1, wherein the designing a semantic segmentation model for a front camera of the AR device, and before segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, further comprises:
the feature graph of the last convolution of the convolutional neural network model is subjected to back propagation through a convolutional neural network visualization method to calculate corresponding weight, each feature graph is multiplied by the weight to obtain a feature graph with weight, the average value of the feature graphs is calculated, and up-sampling is carried out to obtain a fine annotation learned by the coarse annotation, so that training of the convolutional neural network model based on weak supervision can be carried out by using the coarse annotation.
3. The method of claim 2, wherein the coarse annotations comprise bounding boxes or labels and the fine annotations comprise heatmaps or masks.
4. The method as claimed in claim 2, wherein the step of designing a semantic segmentation model for the front camera of the AR device, obtaining a semantic segmentation image by segmenting each object in the scene in front of the user through the semantic segmentation model, and obtaining a depth image of the user's own environment by planning and enhancing a target that the user needs to pay attention to through the SLAM technique further includes:
and the user randomly unfreezes the rear-end part parameters of the semantic segmentation model according to the self environment to realize self-definition.
5. The method of claim 4, wherein a semantic segmentation model is designed for a front camera of the AR device, and the semantic segmentation model is used for segmenting each object in the scene in front of the eyes of the user to obtain a semantic segmentation image comprises:
pre-training a semantic segmentation model by using a data set containing all categories, performing transfer learning after a convolutional neural network layer fully learns the textures of various images, and performing targeted training on data with high AR equipment correlation;
designing a semantic segmentation model by adopting a cavity convolution and a space pyramid type cavity pooling structure, wherein the cavity convolution part uses a multi-scale region for object positioning, combines a plurality of shrinkage integral branches with different expansion degrees together and performs image segmentation by using a multi-scale characteristic;
and separating each object in the scene in front of the eyes of the user by using a fully connected conditional random field at the rear end of the semantic segmentation model.
6. The method of claim 4, wherein the SLAM technology is a visual SLAM technology based on a binocular fisheye camera, and the ORB-SLAM 3-based system is constructed by the visual SLAM technology.
7. The method of claim 6, wherein the ORB-SLAM 3-based system comprises:
the map set is used for maintaining an activation map through a mixed map set consisting of a series of discrete maps to locate a new key frame, and continuously optimizing and updating the activation map through a local mapping process;
the tracking algorithm is used for calculating the reprojection error of the minimized matching feature points and screening the key frames by processing the data of the sensor and calculating the poses of the current frame and the activation map in real time, when the system loses tracking, the mixed map set is used for repositioning, if repositioning is successful, tracking is continued, and if repositioning fails, a new activation map is reinitialized for tracking and mapping;
a local map building for optimizing a map by adding key frames, feature points to an activation map, deleting redundant frames and using BA optimization of visual or inertial navigation;
and the cycle and the map fusion are used for detecting the same area in the dynamic activation map and the mixed map set, if the same area is in the activation map, a closed-loop process is executed, the activation map is optimized through a global BA after the closed-loop process is executed, and if the activation map and the mixed map set do not have the same map, the activation map and the mixed map set are fused into one map.
8. The method of claim 7, wherein the ORB-SLAM 3-based system estimates the parameters to initialize and optimize the IMU using methods in local mapping with inertial navigation.
9. The method of claim 7, wherein the BA optimization step comprises:
calculating a normalized space point coordinate corresponding to the pixel coordinate on the first image aiming at the matched corresponding pixel coordinate of the first image and the second image on the activation map;
and calculating the pixel coordinate re-projected on the second image according to the space point coordinate, and if the re-projected pixel coordinate is not completely overlapped with the pixel coordinate on the matched second image, establishing an equation parallel cubic equation for each matched pixel coordinate to form an overdetermined equation, and solving an optimal pose matrix or space point coordinate.
10. The method of any of claims 1 to 9, wherein the semantic segmentation model is improved based on a depeplab model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011224040.6A CN112365604B (en) | 2020-11-05 | 2020-11-05 | AR equipment depth information application method based on semantic segmentation and SLAM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011224040.6A CN112365604B (en) | 2020-11-05 | 2020-11-05 | AR equipment depth information application method based on semantic segmentation and SLAM |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112365604A true CN112365604A (en) | 2021-02-12 |
CN112365604B CN112365604B (en) | 2024-08-23 |
Family
ID=74508734
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011224040.6A Active CN112365604B (en) | 2020-11-05 | 2020-11-05 | AR equipment depth information application method based on semantic segmentation and SLAM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112365604B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409231A (en) * | 2021-06-10 | 2021-09-17 | 杭州易现先进科技有限公司 | AR portrait photographing method and system based on deep learning |
CN113409331A (en) * | 2021-06-08 | 2021-09-17 | Oppo广东移动通信有限公司 | Image processing method, image processing apparatus, terminal, and readable storage medium |
CN113537171A (en) * | 2021-09-16 | 2021-10-22 | 北京易航远智科技有限公司 | Dividing method of SLAM map |
CN113643357A (en) * | 2021-07-12 | 2021-11-12 | 杭州易现先进科技有限公司 | AR portrait photographing method and system based on 3D positioning information |
CN113781363A (en) * | 2021-09-29 | 2021-12-10 | 北京航空航天大学 | Image enhancement method with adjustable defogging effect |
CN113963000A (en) * | 2021-10-21 | 2022-01-21 | 北京字节跳动网络技术有限公司 | Image segmentation method, device, electronic equipment and program product |
CN114863165A (en) * | 2022-04-12 | 2022-08-05 | 南通大学 | Vertebral body bone density classification method based on fusion of image omics and deep learning features |
CN115294488A (en) * | 2022-10-10 | 2022-11-04 | 江西财经大学 | AR rapid object matching display method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180053056A1 (en) * | 2016-08-22 | 2018-02-22 | Magic Leap, Inc. | Augmented reality display device with deep learning sensors |
US20190051056A1 (en) * | 2017-08-11 | 2019-02-14 | Sri International | Augmenting reality using semantic segmentation |
CN109583457A (en) * | 2018-12-03 | 2019-04-05 | 荆门博谦信息科技有限公司 | A kind of method and robot of robot localization and map structuring |
CN110827305A (en) * | 2019-10-30 | 2020-02-21 | 中山大学 | Semantic segmentation and visual SLAM tight coupling method oriented to dynamic environment |
-
2020
- 2020-11-05 CN CN202011224040.6A patent/CN112365604B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180053056A1 (en) * | 2016-08-22 | 2018-02-22 | Magic Leap, Inc. | Augmented reality display device with deep learning sensors |
US20190051056A1 (en) * | 2017-08-11 | 2019-02-14 | Sri International | Augmenting reality using semantic segmentation |
CN109583457A (en) * | 2018-12-03 | 2019-04-05 | 荆门博谦信息科技有限公司 | A kind of method and robot of robot localization and map structuring |
CN110827305A (en) * | 2019-10-30 | 2020-02-21 | 中山大学 | Semantic segmentation and visual SLAM tight coupling method oriented to dynamic environment |
Non-Patent Citations (1)
Title |
---|
李宾皑;李颖;郝鸣阳;顾书玉;: "弱监督学习语义分割方法综述", 数字通信世界, no. 07, 1 July 2020 (2020-07-01) * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409331A (en) * | 2021-06-08 | 2021-09-17 | Oppo广东移动通信有限公司 | Image processing method, image processing apparatus, terminal, and readable storage medium |
CN113409331B (en) * | 2021-06-08 | 2024-04-12 | Oppo广东移动通信有限公司 | Image processing method, image processing device, terminal and readable storage medium |
CN113409231A (en) * | 2021-06-10 | 2021-09-17 | 杭州易现先进科技有限公司 | AR portrait photographing method and system based on deep learning |
CN113643357A (en) * | 2021-07-12 | 2021-11-12 | 杭州易现先进科技有限公司 | AR portrait photographing method and system based on 3D positioning information |
CN113537171A (en) * | 2021-09-16 | 2021-10-22 | 北京易航远智科技有限公司 | Dividing method of SLAM map |
CN113781363B (en) * | 2021-09-29 | 2024-03-05 | 北京航空航天大学 | Image enhancement method with adjustable defogging effect |
CN113781363A (en) * | 2021-09-29 | 2021-12-10 | 北京航空航天大学 | Image enhancement method with adjustable defogging effect |
CN113963000A (en) * | 2021-10-21 | 2022-01-21 | 北京字节跳动网络技术有限公司 | Image segmentation method, device, electronic equipment and program product |
CN113963000B (en) * | 2021-10-21 | 2024-03-15 | 抖音视界有限公司 | Image segmentation method, device, electronic equipment and program product |
CN114863165A (en) * | 2022-04-12 | 2022-08-05 | 南通大学 | Vertebral body bone density classification method based on fusion of image omics and deep learning features |
CN114863165B (en) * | 2022-04-12 | 2023-06-16 | 南通大学 | Vertebral bone density classification method based on fusion of image histology and deep learning features |
CN115294488B (en) * | 2022-10-10 | 2023-01-24 | 江西财经大学 | AR rapid object matching display method |
CN115294488A (en) * | 2022-10-10 | 2022-11-04 | 江西财经大学 | AR rapid object matching display method |
Also Published As
Publication number | Publication date |
---|---|
CN112365604B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112365604B (en) | AR equipment depth information application method based on semantic segmentation and SLAM | |
Sahu et al. | Artificial intelligence (AI) in augmented reality (AR)-assisted manufacturing applications: a review | |
Moreau et al. | Lens: Localization enhanced by nerf synthesis | |
US11238606B2 (en) | Method and system for performing simultaneous localization and mapping using convolutional image transformation | |
JP7151016B2 (en) | A Deep Machine Learning System for Cuboid Detection | |
CN112771539B (en) | Employing three-dimensional data predicted from two-dimensional images using neural networks for 3D modeling applications | |
US11094137B2 (en) | Employing three-dimensional (3D) data predicted from two-dimensional (2D) images using neural networks for 3D modeling applications and other applications | |
US20180012411A1 (en) | Augmented Reality Methods and Devices | |
WO2022165809A1 (en) | Method and apparatus for training deep learning model | |
Kumar et al. | Monocular fisheye camera depth estimation using sparse lidar supervision | |
Won et al. | Sweepnet: Wide-baseline omnidirectional depth estimation | |
CN108230240A (en) | It is a kind of that the method for position and posture in image city scope is obtained based on deep learning | |
CN113674416B (en) | Three-dimensional map construction method and device, electronic equipment and storage medium | |
CN114972617B (en) | Scene illumination and reflection modeling method based on conductive rendering | |
KR20220081261A (en) | Method and apparatus for object pose estimation | |
US11948310B2 (en) | Systems and methods for jointly training a machine-learning-based monocular optical flow, depth, and scene flow estimator | |
CN113256699B (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN116194951A (en) | Method and apparatus for stereoscopic based 3D object detection and segmentation | |
Jia et al. | Depth measurement based on a convolutional neural network and structured light | |
CN116883961A (en) | Target perception method and device | |
Bai et al. | Cyber mobility mirror for enabling cooperative driving automation: A co-simulation platform | |
US20220180548A1 (en) | Method and apparatus with object pose estimation | |
US12002227B1 (en) | Deep partial point cloud registration of objects | |
Wang et al. | Research on 3D Sampling and Monitoring of Power Supplies Based on Augmented Reality (AR) Technology | |
Liang et al. | Semantic map construction based on LIDAR and vision fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |