CN112365604A

CN112365604A - AR equipment depth of field information application method based on semantic segmentation and SLAM

Info

Publication number: CN112365604A
Application number: CN202011224040.6A
Authority: CN
Inventors: 瞿岩松; 夏轩; 陈卫兴
Original assignee: Shenzhen Zhongke Xianjian Medical Technology Co ltd
Current assignee: Shenzhen Zhongke Xianjian Medical Technology Co ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-12
Anticipated expiration: 2040-11-05
Also published as: CN112365604B

Abstract

The application relates to an AR equipment depth of field information application method based on semantic segmentation and SLAM; the method comprises the following steps: designing a semantic segmentation model for a front camera of the AR equipment, and segmenting each object in the scene in front of eyes of a user through the semantic segmentation model to obtain a semantic segmentation image; planning and prompting enhancement are carried out on a target which needs to be noticed by a user through an SLAM technology, and a depth image of the environment of the user is obtained; and fusing the depth image and the semantic segmentation image. The binocular fisheye camera is used for realizing the SLAM of pure vision, the map is dynamically constructed through a dense vision method, the positioning and the depth segmentation of each object in the environment are realized, the coupling of virtual information and the real environment is realized, and the depth image and the semantic segmentation image are fused and then the user is guided to watch and observe the objects with different depths so as to adjust the vision of the user.

Description

AR equipment depth of field information application method based on semantic segmentation and SLAM

Technical Field

The application relates to the technical field of artificial intelligence image processing, in particular to an AR equipment depth of field information application method based on semantic segmentation and SLAM.

Background

With the development of information technology, life increasingly tends to use the Internet of Things (IoT) to realize digital life of smart life concepts, such as smart home systems, personal health monitoring, or extensive machine-to-machine communication. Augmented Reality (AR) is a core technology that facilitates human integration into such systems, and is a technology that combines virtual reality with reality, providing an interface for people to interact with the digital world of smart life. Although AR is not ready for deployment in the fields of medicine, productive life, industrial design, etc., it has been used in other fields such as entertainment. In recent years, the explosive growth of electronic miniaturization and the explosive enhancement of computing power have made it possible to develop AR systems with capabilities related to consumers and industries. The AR system enables humans to access digital information through a layer of information located above the physical world. According to a widely used reality-virtual continuum, the AR is located between a real environment and a virtual environment, namely the real world environment and the virtual reality environment, the position of the system in the environment and the positions of objects in the environment are accurately calculated through the positions of the camera and the sensor, and virtual information is combined and interacted with a real scene through an image analysis technology. In general, the basic components of an AR system are visualization technology, a sensor system, a tracking system, a processing unit and a user interface. Visualization technologies can visualize digital information in a real environment, mainly including four technologies, namely head-mounted displays, handheld devices, static screens and projectors; the sensing system functions to acquire information from the environment, and for most systems, its central input is one or more cameras, including ordinary optical cameras, infrared cameras, depth cameras, etc.; the tracking system is the key of the system, so that the digital object can be accurately placed in the physical world; the user interface is an interactive mode for realizing two-way communication between the system and the user, such as force feedback and sound prompt output by the system and input by the user; the processing unit is responsible for executing software to run the AR system. However, the current AR systems can be generally divided into two types, i.e., virtual-real combination based on the marked points (anchor points) and coupling by a non-marked method. The former was more common and mature earlier, but the way of marking points makes AR applications quite limited; the latter is coupled with a tracking and positioning algorithm through a sensor, and depends on the performance of hardware, and meanwhile, the trade-off between the complexity of the sensor and the system and the trade-off between the accuracy of the algorithm and the performance of the hardware make the system difficult to obtain an ideal effect.

Therefore, augmented reality is expected to become a future general computing platform, and the sensing and tracking system in AR is an indispensable and most critical ring, and the component has a synchronous positioning and mapping technology (SLAM) of a sensor of a hardware part and a software algorithm part. The SLAM is incremental map construction according to its own position through environmental features continuously observed in the system motion process, and simultaneously, the SLAM can more finely construct the position and the posture of each object in a three-dimensional space of the environment besides a simple two-dimensional plane SLAM (planar map) and a three-dimensional stereo SLAM, so that the coupling between virtual digital information in the AR system and the real environment is greatly facilitated. However, due to factors such as the structural nature and cost of the AR device, the sensors in the system are mostly purely visual, i.e., based on a purely visual SLAM. Unlike radar-based SLAM and hybrid SLAM, the latter can often construct a map with higher progress through radar, and the former has richer semantic information due to collected data, but has high operation cost and performance to be improved. There is therefore a need to further improve pure visual SLAM in an AR-based application environment and to exploit semantic information in the environment.

In an AR system in the prior art, an algorithm model is large in size, large in occupied computational power, light in weight and limited in hardware computational power, and has no pertinence when required to be real-time. Although the laser SLAM is mature in technology and high in reliability, the cost is high, and the radar scanning range and the installation structure are limited. And are bulky and unsuitable for AR systems.

Disclosure of Invention

Based on this, it is necessary to provide an AR device depth of field information application method based on semantic segmentation and SLAM, aiming at the problems of large size and large computational power of the existing algorithm.

In order to achieve the above object, an embodiment of the present application provides an AR device depth information application method based on semantic segmentation and SLAM, including:

designing a semantic segmentation model for a front camera of the AR equipment, and segmenting each object in the scene in front of eyes of a user through the semantic segmentation model to obtain a semantic segmentation image;

planning and prompting enhancement are carried out on a target which needs to be noticed by a user through an SLAM technology, and a depth image of the environment of the user is obtained;

and fusing the depth image and the semantic segmentation image.

Preferably, before designing a semantic segmentation model for the front camera of the AR device, segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, the method further includes:

the feature graph of the last convolution of the convolutional neural network model is subjected to back propagation through a convolutional neural network visualization method to calculate corresponding weight, each feature graph is multiplied by the weight to obtain a feature graph with weight, the average value of the feature graphs is calculated, and up-sampling is carried out to obtain a fine annotation learned by the coarse annotation, so that training of the convolutional neural network model based on weak supervision can be carried out by using the coarse annotation.

Preferably, the coarse annotation comprises a bounding box or label and the fine annotation comprises a heat map or mask.

Preferably, the designing a semantic segmentation model for the front camera of the AR device, segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, and planning and enhancing a target that the user needs to pay attention to through an SLAM technique, and obtaining the depth image of the environment of the user further includes:

and the user randomly unfreezes the rear-end part parameters of the semantic segmentation model according to the self environment to realize self-definition.

Preferably, the designing a semantic segmentation model for the front camera of the AR device, and segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image includes:

pre-training a semantic segmentation model by using a data set containing all categories, performing transfer learning after a convolutional neural network layer fully learns the textures of various images, and performing targeted training on data with high AR equipment correlation;

designing a semantic segmentation model by adopting a cavity convolution and a space pyramid type cavity pooling structure, wherein the cavity convolution part uses a multi-scale region for object positioning, combines a plurality of shrinkage integral branches with different expansion degrees together and performs image segmentation by using a multi-scale characteristic;

and separating each object in the scene in front of the eyes of the user by using a fully connected conditional random field at the rear end of the semantic segmentation model.

Preferably, the SLAM technology is a visual SLAM technology based on a binocular fish-eye camera, and the ORB-SLAM 3-based system is constructed by the visual SLAM technology.

Preferably, the ORB-SLAM 3-based system includes:

the map set is used for maintaining an activation map through a mixed map set consisting of a series of discrete maps to locate a new key frame, and continuously optimizing and updating the activation map through a local mapping process;

the tracking algorithm is used for calculating the reprojection error of the minimized matching feature points and screening the key frames by processing the data of the sensor and calculating the poses of the current frame and the activation map in real time, when the system loses tracking, the mixed map set is used for repositioning, if repositioning is successful, tracking is continued, and if repositioning fails, a new activation map is reinitialized for tracking and mapping;

a local map building for optimizing a map by adding key frames, feature points to an activation map, deleting redundant frames and using BA optimization of visual or inertial navigation;

and the cycle and the map fusion are used for detecting the same area in the dynamic activation map and the mixed map set, if the same area is in the activation map, a closed-loop process is executed, the activation map is optimized through a global BA after the closed-loop process is executed, and if the activation map and the mixed map set do not have the same map, the activation map and the mixed map set are fused into one map.

Preferably, the ORB-SLAM3 based system estimates the parameters to initialize and optimize the IMU using the method in local mapping with inertial navigation.

Preferably, the BA optimization step includes:

calculating a normalized space point coordinate corresponding to the pixel coordinate on the first image aiming at the matched corresponding pixel coordinate of the first image and the second image on the activation map;

and calculating the pixel coordinate re-projected on the second image according to the space point coordinate, and if the re-projected pixel coordinate is not completely overlapped with the pixel coordinate on the matched second image, establishing an equation parallel cubic equation for each matched pixel coordinate to form an overdetermined equation, and solving an optimal pose matrix or space point coordinate.

Preferably, the semantic segmentation model is improved on the basis of a deplab model.

One of the above technical solutions has the following advantages and beneficial effects:

the AR equipment depth of field information application method based on semantic segmentation and SLAM provided by the embodiments of the application effectively solves the problems of large size and large occupied computing power of the existing algorithm, meanwhile, segmented objects which are not suitable for an AR environment are abandoned, a part of segmentation categories under weak supervision are combined, the size of the algorithm is reduced, a depth segmentation image formed by combining a depth image dominated by SLAM and a segmentation image of semantic segmentation is guided to a user to watch objects at different depths through the depth segmentation image so as to adjust the vision of the user, and the experience of the user is enhanced.

Drawings

Fig. 1 is a schematic flowchart of an AR device depth of field information application method based on semantic segmentation and SLAM in an embodiment.

Detailed Description

To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present application are shown in the drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

It will be understood that when an element is referred to as being "connected" to another element, it can be directly connected to the other element and be integral therewith, or intervening elements may also be present. The terms "one end," "the other end," and the like are used herein for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In order to solve the problem that the conventional technology cannot verify the reliability and feasibility of an artificial retina product, in one embodiment, as shown in fig. 1, a method for applying depth information of an AR device based on semantic segmentation and SLAM is provided, and includes:

s100, designing a semantic segmentation model for a front camera of the AR equipment, and segmenting each object in the scene in front of eyes of a user through the semantic segmentation model to obtain a semantic segmentation image;

s200, planning and prompting a target to be noticed by a user through an SLAM technology to obtain a depth image of the environment of the user;

and S300, fusing the depth image and the semantic segmentation image.

The semantic segmentation is to classify each pixel point of the picture, in short, the picture is composed of the pixel points, and the semantic segmentation is to read the picture by using an artificial intelligence algorithm and classify the pixel points belonging to the same object. For example, a shot picture is taken as an input picture, different color blocks are divided into result pictures after semantic segmentation, and different things in the picture shot by the camera are automatically classified through an artificial intelligent image processing algorithm, such as tree coverage yellow, all building coverage red, all automobile coverage purple, road gray and sidewalk sky blue. Semantic segmentation can segment each object in the scene in front of the eyes of the user, and can help the user to distinguish the targets.

After semantic segmentation, planning and prompting enhancement are carried out on the target which needs to be noticed by the user. A binocular fisheye camera arranged in front of the system performs matching calculation through the distance between the cameras to obtain a scale and a space model, then, on the basis, map construction and object positioning are performed, and a depth image of the user's own environment is obtained. And finally, combining with the result of semantic segmentation to obtain a depth segmentation image, and coupling with a real scene to realize the depth-of-field-based attention guidance effect.

The AR equipment is not limited to glasses, and can be an intelligent terminal such as a mobile phone.

In specific implementation, before designing a semantic segmentation model for the front camera of the AR device and segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, the method further includes:

the method comprises the steps of performing back propagation on a feature map of the last convolution of a convolution neural network model through a convolution neural network visualization method to calculate corresponding weight, multiplying each feature map by the weight to obtain a feature map with the weight, calculating an average value of the feature maps, and performing up-sampling to obtain a fine annotation learned by a coarse annotation, so that training of the convolution neural network model based on weak supervision can be performed by using the coarse annotation, wherein the coarse annotation comprises a boundary box or a label, and the fine annotation comprises a heat map or a mask.

The method is a data enhancement method based on weak supervision, the invention innovatively proposes to enhance data by using a weak supervision mode, generally, the weak annotation data can be segmented directly based on the weak supervision mode, and the invention continues to segment the weak annotation for the next time by taking the segmented result as the annotation after segmenting the weak annotation, namely, the mode of self supervision.

And (3) leading Grad-CAM (convolutional neural network) visualization method, wherein the number of the feature maps of the last convolution is equal to the number of the types of the data to be classified, and each feature map represents the probability map of each type.

In specific implementation, the designing a semantic segmentation model for the front camera of the AR device, segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, and planning and enhancing a target which needs attention of the user through an SLAM technology, wherein the obtaining of the depth image of the environment of the user further includes:

and the user randomly unfreezes the rear-end part parameters of the semantic segmentation model according to the self environment to realize self-definition. And the user-defined model fine adjustment is carried out according to the self environment of the user, so that the use experience of the user is gradually improved through long-time use with low calculation power.

In specific implementation, the designing a semantic segmentation model for the front camera of the AR device, and segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image includes:

The deep learning method is a convolutional neural network, and the first popular segmentation method for deep learning is a patch classification (patch classification). The central pixel is classified by extracting surrounding pixels on a pixel-by-pixel basis. Since all the convolutional network ends at the time use fully connected layers (full connected layers), only this pixel-by-pixel segmentation method can be used.

In specific implementation, the SLAM technology is a visual SLAM technology based on a binocular fisheye camera, and a system based on ORB-SLAM3 is constructed through the visual SLAM technology. The system encodes the system for repositioning, closed-loop detection and map fusion based on a bag-of-words model, and can operate robustly in a pure vision or vision inertial navigation system.

The visual SLAM technology is low in cost and free of constraints, but is mostly suitable for road segmentation and map construction outdoors, fine results cannot be easily obtained for environments with complex indoor illumination changes, various obstacles and small closed loops, and most of the visual SLAMs can be made only by a dense visual method based on an RGB-D camera. The camera is classified into a monocular camera SLAM, a binocular camera SLAM, an RGB-D camera SLAM, and the like. The monocular camera SLAM has a scale for the track and the map with the real size, and cannot sense the real depth, so the initialization is needed; under the condition that the base line distance between two monocular cameras is known, the binocular camera SLAM can obtain the depth through calibration matching and calculation, but the calculation is wasted generally; the RGB-D camera SLAM, also called depth camera SLAM, can directly get depth information by stereo structured light and TOF techniques, monocular, binocular and RGB-D utilize pinhole or fish eye models, and can also define models by themselves.

The method used for SLAM may be a direct method such as a dense vision method and a semi-dense vision method.

In specific implementation, the ORB-SLAM 3-based system includes:

In concrete implementation, BA optimization in SLAM first calculates normalized spatial point coordinates corresponding to pixel coordinates on an a image according to a camera model and A, B image feature matched pixel coordinates, then calculates pixel coordinates re-projected onto a B image according to the spatial point coordinates, the re-projected pixel coordinates (estimated value) and the matched pixel coordinates (measured value) on the B image do not completely coincide, and the purpose of BA is to establish an equation for each matched feature point, then form an over-determined equation in a simultaneous manner, and solve an optimal pose matrix or spatial point coordinates (both can be optimized simultaneously).

In particular, the ORB-SLAM 3-based system estimates the parameters for initializing and optimizing IMUs using the method in local mapping with inertial navigation.

In specific implementation, the BA optimization step includes:

In specific implementation, the semantic segmentation model is improved on the basis of the deplab model, extensive learning is performed on a known open-source data set, then labels with low relevance are reduced, and secondary training is performed. Wherein the dataset is subjected to a weakly supervised based data enhancement in a pre-processing stage, it is thus also applicable to image classification datasets comprising labeled bounding boxes in addition to image segmented datasets. In the using process, an online learning mechanism can be added, and the segmentation and positioning effects of the user environment are enhanced.

In summary, the AR equipment depth of field information application method based on semantic segmentation and SLAM provided by the invention has the advantages that the parameter pruning of AR application solves the problems of large size and large occupied calculation capacity of the existing algorithm, meanwhile, segmented objects which are not suitable for an AR environment are abandoned, a part of segmentation categories under weak supervision are combined, a part of object labels of the objects are collectively called as obstacles, and the size of the algorithm is reduced; meanwhile, the invention uses a binocular fisheye camera to realize pure visual SLAM, dynamically constructs a map by a dense visual method, positions and deeply segments each object in the environment, realizes the coupling of virtual information and a real environment, and guides a user to watch and observe objects with different depths through a deeply segmented image so as to adjust the vision of the user.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An AR equipment depth of field information application method based on semantic segmentation and SLAM is characterized by comprising the following steps:

and fusing the depth image and the semantic segmentation image.

2. The method for applying the depth-of-field information of the AR device based on semantic segmentation and SLAM as claimed in claim 1, wherein the designing a semantic segmentation model for a front camera of the AR device, and before segmenting each object in the scene in front of the eyes of the user through the semantic segmentation model to obtain a semantic segmentation image, further comprises:

3. The method of claim 2, wherein the coarse annotations comprise bounding boxes or labels and the fine annotations comprise heatmaps or masks.

4. The method as claimed in claim 2, wherein the step of designing a semantic segmentation model for the front camera of the AR device, obtaining a semantic segmentation image by segmenting each object in the scene in front of the user through the semantic segmentation model, and obtaining a depth image of the user's own environment by planning and enhancing a target that the user needs to pay attention to through the SLAM technique further includes:

5. The method of claim 4, wherein a semantic segmentation model is designed for a front camera of the AR device, and the semantic segmentation model is used for segmenting each object in the scene in front of the eyes of the user to obtain a semantic segmentation image comprises:

6. The method of claim 4, wherein the SLAM technology is a visual SLAM technology based on a binocular fisheye camera, and the ORB-SLAM 3-based system is constructed by the visual SLAM technology.

7. The method of claim 6, wherein the ORB-SLAM 3-based system comprises:

8. The method of claim 7, wherein the ORB-SLAM 3-based system estimates the parameters to initialize and optimize the IMU using methods in local mapping with inertial navigation.

9. The method of claim 7, wherein the BA optimization step comprises:

10. The method of any of claims 1 to 9, wherein the semantic segmentation model is improved based on a depeplab model.