WO2023184278A1

WO2023184278A1 - Method for semantic map building, server, terminal device and storage medium

Info

Publication number: WO2023184278A1
Application number: PCT/CN2022/084205
Authority: WO
Inventors: Yun-Jou Lin; Dawei ZHONG
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2023-10-05

Abstract

A method for semantic map building, a server, a terminal device and a storage medium are disclosed. The method, which is applied to a server, includes that: first image data and pose data sent by a terminal device are received; a Three Dimension (3D) grid model is generated according to the first image data and the pose data; and semantic segmentation is performed to the 3D grid model to obtain a target semantic map, the target semantic map being used for displaying a virtual object in a physical environment.

Description

METHOD FOR SEMANTIC MAP BUILDING, SERVER, TERMINAL DEVICE AND STORAGE MEDIUM

TECHNICAL FIELD

Embodiments of the disclosure relate to the field of computer vision technology, and more particularly to a method for semantic map building, a server, a terminal device and a storage medium.

BACKGROUND

In recent years, with the development of Augmented Reality (AR) devices, AR technology is everywhere. The AR technology is a new technology seamlessly combines information of the real world with information of the virtual world, through which, virtual information can be applied to the real world, and then perceived by human senses, so that people feel "immersive" reality.

In related technologies, most AR devices have the Simultaneous Localization And Mapping (SLAM) technology that allows a user to generate maps or grids and locates the user. However, although there are already solutions of generating Three Dimension (3D) semantic information, there is no comprehensive consideration in the existing technical solutions, as a result, the accuracy of the whole rebuilt semantic map is low, and a virtual object cannot be displayed on a specific physical object better.

SUMMARY

The disclosure provides a method for semantic map building, a server, a terminal device and a storage medium.

The technical solution of the application may be implemented as follows.

In a first aspect, the embodiments of the disclosure provide a method for semantic map building, which is applied to a server, and may include the following operations. First image data and pose data sent by a terminal device are received. A 3D grid model is generated according to the first image data and the pose data. Semantic segmentation is performed on the 3D grid model to obtain a target semantic map, the target semantic map being used for displaying a virtual object in a physical environment.

In a second aspect, the embodiments of the disclosure provide a method for semantic map building, which is applied to a terminal device, and may include the following operations. Collected data of a wearable device is obtained, the collected data including first image data and second image data. The second image data is processed to generate pose data. The first image data and the pose data are sent to a server, first image data and the pose data being used for the server to build a target semantic map.

In a third aspect, the embodiments of the disclosure provide a server, which may include: a first receiving unit, a modeling unit and a segmenting unit. The first receiving unit is configured to receive the first image data and the pose data sent by the terminal device. The modeling unit is configured to generate the 3D grid model according to the first image data and the pose data. The segmenting unit is configured to perform semantic segmentation on the 3D grid model to obtain the target semantic map, the target semantic map being used for displaying the virtual object in the physical environment.

In a fourth aspect, the embodiments of the disclosure provide a server, which may include: a first memory and a first processor. The first memory is configured to store a computer program capable of running on the first processor. The first processor is configured to execute, when running the computer program, the method in the first aspect.

In a fifth aspect, the embodiments of the disclosure provide a terminal device, which may include: an obtaining unit, a data processing unit and a second sending unit. The obtaining unit is configured to obtain the collected data of the wearable device, the collected data including the first image data and the second image data. The data processing unit is configured to process the second image data to generate the pose data. The second sending unit is configured to send the first image data and the pose data to the server, the first image data and the pose data being used for the server to build the target semantic map.

In a sixth aspect, the embodiments of the disclosure provide a terminal device, which may include: a second memory and a second processor. The second memory is configured to store a computer program capable of running on the second processor. The second processor is configured to execute, when running the computer program, the method in the second aspect.

In a seventh aspect, the embodiments of the disclosure provide a computer storage medium, in which a computer program is stored. The computer program implements the method in the first aspect when executed by the first processor, or implements the method in the second aspect when executed by the second processor.

The embodiments of the disclosure provide a method for semantic map building, a server, a terminal device and a storage medium. At the terminal device side, collected data of a wearable device is obtained, the collected data including first image data and second image data; the second image data is processed to generate pose data; the first image data and the pose data are sent to a server, so that the server builds a target semantic map. At the server side, the first image data and the pose data sent by the terminal device are received; a 3D grid model is generated according to the first image data and the pose data; and semantic segmentation is performed on the 3D grid model to obtain a target semantic map, the target semantic map being used for displaying a virtual object in a physical environment. In this way, the pose data is also used in building of the target semantic map, thus a more complete scene reconstruction can be achieved, the accuracy of the whole semantic map is improved, and a virtual object can be displayed on a specific physical object better. Additionally, since the scene reconstruction and the semantic segmentation are performed in the server, it can reduce memory usage and power consumption of the terminal device.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of the composition of a vision enhancement system according to an embodiment of the disclosure.

FIG. 2 is a flowchart of a method for semantic map building according to an embodiment of the disclosure.

FIG. 3 is a structural schematic diagram of AR glasses according to an embodiment of the disclosure.

FIG. 4 is a schematic diagram of a marching cube according to an embodiment of the disclosure.

FIG. 5 is an architecture diagram of semantic segmentation and instance segmentation according to an embodiment of the disclosure.

FIG. 6 is a flowchart of another method for semantic map building according to an embodiment of the disclosure.

FIG. 7 is an application diagram of an AR application according to an embodiment of the disclosure.

FIG. 8 is a detailed flowchart of a method for semantic map building according to an embodiment of the disclosure.

FIG. 9 is a composition structure diagram of a server according to an embodiment of the disclosure.

FIG. 10 is a structure diagram of specific hardware of a server according to an embodiment of the disclosure.

FIG. 11 is a composition structure diagram of a terminal device according to an embodiment of the disclosure.

FIG. 12 is a structure diagram of specific hardware of a terminal device according to an embodiment of the disclosure.

DETAILED DESCRIPTION

In order to understand characteristics and technical contents in the embodiments of the disclosure in more detail, the implementation of the embodiments of the disclosure is elaborated in combination with the accompanying drawings. The accompanying drawings are only used for reference, but not intended to limit the embodiments of the disclosure.

Unless otherwise defined, all technical and scientific terms in the specification have the same meaning as those skilled in the art, belonging to the application, usually understand. Terms used in the specification are only used for describing the purpose of the embodiments of the disclosure, but not intended to limit the application.

"Some embodiments" involved in the following descriptions describes a subset of all possible embodiments. However, it can be understood that "some embodiments" may be the same subset or different subsets of all the possible embodiments, and may be combined without conflicts. It should also be pointed out that term "first/second/third" involved in the embodiments of the disclosure is only for distinguishing similar objects and does not represent a specific sequence of the objects. It can be understood that, "first/second/third" may be interchanged to specific sequences or orders if allowed to implement the embodiments of the disclosure described herein in sequences except the illustrated or described herein.

In recent years, with the development of AR devices (such as Magic Leap One and HoloLens) and development kits (such as AR core and AR kit) , AR technology becomes ubiquitous. Most AR devices and development kits have the SLAM technology that allows a user to generate maps or grids and locates the user. However, there is currently no specialized system architecture to solve the problems of simultaneous localization/scene reconstruction/3D semantic segmentation directly using the AR devices and servers.

In related technologies, there are already solutions of generating 3D semantic information, for example, an Internet depth camera (RGB-D Intel RealSense, RealSense for short) is used to obtain 3D simultaneous semantic information, and the information is displayed on the Magic Leap One. This solution requires the RealSense to connect a server through a physical cable; as a result, data collection is limited by a cable length. Then, each frame of color image including red, green, blue band information and depth image (RGBD) from the RealSense is sent to the server. Here, a semantic segmentation technology is applied to the color image including red, green, blue band information (RGB image) , and the corresponding depth (point cloud) is labeled with a semantic class; finally, a semantic point is sent back to the Magic Leap One to display the semantic information. However, this solution does not combine 6-Degrees of Freedom (DoF) pose data to rebuild the whole semantic map.

It is to be understood that AR is a technology that combines the virtual world with the real world and allows people to interact with the virtual world in real time. The key technology of AR is SLAM and reconstruction of the real world. 6-DoF pose and depth images generated based on the SLAM may achieve a more complete scene reconstruction. In order to more intelligently perceive and interact with a virtual object in the real world, semantic information is needed for scene understanding in a 3D reconstructed world, and RGBD data from the AR device may be used for 3D world perception.

In the embodiments of the disclosure, compared with the server, the AR device (for example, a terminal device with the AR glasses) has low computing power and memory size. In order to achieve large-scale environment reconstruction and simultaneous semantic segmentation, the AR device may not be able to process all data in real time. Therefore, it is necessary to perform simultaneous localization/scene reconstruction/3D semantic segmentation of the system between the AR device and the server. A communication connection between the AR device and the server is established through a Transmission Control Protocol (TCP) . A user may access real-time 6-DoF pose data and RGBD data from the AR device, and send these data to a server for scene reconstruction and 3D semantic segmentation. In this way, through the embodiments of the disclosure, the user may interact with a physical environment in a more intelligent way after the 3D semantic segmentation, for example, a virtual object is displayed on a specific physical object.

Each embodiment of the application is described in detail below in combination with the accompanying drawings.

FIG. 1 shows a schematic diagram of the composition of a vision enhancement system according to an embodiment of the disclosure. As shown in FIG. 1, a vision enhancement system 10 may include a wearable device 110, a terminal device 120 and a server 130. A communication connection between the wearable device 110 and the terminal device 120 is established through a physical cable, and a communication connection between the terminal device 120 and the server 130 is established through the TCP.

It is to be noted that the wearable device 110 may specifically refer to a monocular or binocular Head-Mounted Display (HMD) , for example, the AR glasses. In FIG. 1, the wearable device 110 may include one or more display modules placed near the user's one or both eyes. Through the display module of the wearable device 110, display content therein may be presented in front of the user's eyes, and the display content can fill or partially fill the user's field of vision. It is also to be noted that the display module may refer to one or more Organic Light-Emitting Diode (OLED) modules, Liquid Crystal Display (LCD) modules, laser display modules, etc.

In some embodiments, the wearable device 110 may further include one or more sensors and one or more cameras. For example, the wearable device 110 may further include one or more sensors, such as an Inertial Measurement Unit (IMU) , an accelerometer, a gyroscope, a proximity sensor, and a depth camera, so as to obtain collected data.

It is also to be noted that the terminal device 120 may be implemented in a variety of forms. For example, the terminal devices described in the embodiments of the disclosure may include a smartphone, a tablet personal computer, a notebook computer, a laptop computer, a palm computer, a Personal Digital Assistant (PDA) , a smartwatch, etc. In addition, the server 130 may be a cloud server, a network server, etc. The server 130 may be regarded as a computer for managing computing resources and can provide computing or application services for the terminal device 120. Moreover, the server 130 has high-speed CPU computing power, long duration of reliable operation, strong data handling capacity and better scalability.

Based on the application scenario shown in FIG. 1, FIG. 2 shows a flowchart of a method for semantic map building according to an embodiment of the disclosure. As shown in FIG. 2, the method may include the following steps.

At S201, a terminal device obtains collected data of a wearable device, the collected data including first image data and second image data.

It is to be noted that the embodiments of the disclosure are applied to the vision enhancement system including the wearable device, the terminal device and the server. In the system, the wearable device and the terminal device are in a wired connection through the physical cable to realize information interaction; and a wireless connection is established between the terminal device and the server through the TCP to realize information interaction.

It is to be noted that for the terminal device, it first needs to obtain the collected data of the wearable device. The collected data may include the first image data and the second image data. The first image data is used for scene reconstruction and semantic segmentation, and the second image data is used for estimation of pose data.

At S202, the terminal device processes the second image data to generate pose data.

It is to be noted that the terminal device is equipped with an SLAM system. The SLAM system may use the second image data to estimate a camera pose. In the embodiments of the disclosure, the second image data may include fisheye image data and inertial sensor data. Therefore, in some embodiments, the operation of processing the second image data to generate the pose data may include that, pose calculation is performed on the fisheye image data and the inertial sensor data by using the SLAM system, so as to generate the pose data.

It is to be noted that the SLAM system may use the fisheye image data and the inertial sensor data to generate reliable pose data on the terminal device. Specifically, the 6-DoF pose data may be generated by using the fisheye image data and the inertial sensor data. The six degrees of freedom may include degrees of freedom of movement along rectangular coordinate axes x, y, z and degrees of freedom of rotation around the three axes, so that location information can be completely determined, and then it can be better used for scene reconstruction and semantic segmentation.

In the embodiments of the disclosure, the first image data may refer to RGBD image data, and may specifically include depth image data and RGB image data for scene reconstruction and semantic segmentation; and the second image data may include the fisheye image data and the inertial sensor data for estimation of the pose data. In addition, an inertial sensor may also be called IMU, that is, the inertial sensor data may also be called IMU data.

Specifically, in some embodiments, for S201, the operation of obtaining the collected data of the wearable device may include the following actions.

Depth image data and time stamp information corresponding to the depth image data are obtained through a first thread.

RGB image data and time stamp information corresponding to the RGB image data are obtained through a second thread.

Fisheye image data and time stamp information corresponding to the fisheye image data are obtained through a third thread.

Inertial sensor data and time stamp information corresponding to the inertial sensor data are obtained through a fourth thread.

The time stamp information is used for measuring whether the depth image data, the RGB image data, the fisheye image data and the inertial sensor data are synchronous in time.

It is to be noted that for the wearable device, for example, the AR glasses shown in FIG. 3, data may be acquired through the AR glasses, thus the depth image data, the RGB image data, the fisheye image data and the IMU data can be acquired. Here, the depth image data is obtained by a depth camera, the RGB image data is obtained by an RGB camera, the fisheye image data is obtained by a wide-angle fisheye camera, and the IMU data is obtained by the IMU.

It is also to be noted that the depth image data and the RGB image data may be used for scene reconstruction and semantic segmentation, the fisheye image data and the IMU data may be used for estimation of camera pose, and then all the data may be sent to the terminal device through the physical cable for further processing.

It is also to be noted that the SLAM system uses the fisheye image data and the inertial sensor data to generate reliable pose data on the terminal device. Here, since a fisheye image has a larger field of vision than that of an RGB image, more reliable pose data can be obtained in this way. In addition, compared with using visual SLAM only, the estimation of camera pose using visual inertial SLAM in the embodiments of the disclosure has better robustness and accuracy.

In this way, the terminal device may obtain the depth image data, the RGB image data, the fisheye image data, the inertial sensor data and their corresponding time stamp information from the wearable device through different threads. The time stamp information here is used for obtaining synchronous data. Since the time of sending each datum in the AR glasses may be out of sync, the time stamp information is needed to select data at similar time points to synchronize data.

Further, in some embodiments, the method may further include that: time synchronization is performed on the depth image data, the RGB image data and the pose data through a time synchronization program.

In a specific embodiment, time synchronization may include the following operations.

First time stamp information corresponding to the pose data is determined.

Depth image data in synchronization with the first time stamp information is selected from the depth image data and time stamp information corresponding to the depth image data.

RGB image data in synchronization with the first time stamp information is selected from the RGB image data and time stamp information corresponding to the RGB image data.

First image data is determined according to the selected depth image data and the selected RGB image data. The first image data and the pose data are synchronous in time.

It is to be noted that in the embodiments of the disclosure, the pose data and the first time stamp information may be obtained according to the fisheye image data, the inertial sensor data, and their corresponding time stamp information; then, the depth image data and the RGB image data which are synchronous in time may be selected according to the first time stamp information; at this time, the obtained first image data and pose data are synchronous in time.

It is also to be noted that because the acquired data/data sets come from different hardware and different threads, time stamps of the data/data sets will be different and out of sync. For better scene reconstruction and semantic segmentation, data synchronization is necessary. Therefore, the terminal device may be provided with the time synchronization program, so that the time synchronization program runs on the terminal device. Only synchronous data is allowed to be sent to the server.

At S203, the terminal device sends the first image data and the pose data to the server.

In the embodiments of the disclosure, the first image data and the pose data are sent to the server by the terminal device only in case of time synchronization between them, so that the server builds a target semantic map.

In addition, the terminal device and the server are in a wireless connection, specifically through the TCP, to realize information interaction. Therefore, in some embodiments, the operation of sending the first image data and the pose data to the server may include the following: a TCP connection is established with the server, and the first image data and the pose data are sent to the server based on the TCP connection.

It is to be noted that the TCP connection needs to be established between the terminal device and the server, and then the terminal device may send data to the server or receive data from the server based on the TCP connection. In this way, only when the first image data and the pose data are synchronous data, they are allowed to be sent to the server. In this way, data can be sent efficiently, and a bandwidth of TCP communication can be reduced.

At S204, the server generates a 3D grid model according to the first image data and the pose data.

It is to be noted that the server needs to receive the first image data and the pose data sent by the terminal device, which may specifically include that: the server establishes the TCP connection with the terminal device; and the server receives the first image data and the pose data sent by the terminal device based on the TCP connection.

That is, after the TCP connection is established between the server and the terminal device, the server may receive the first image data and the pose data sent by the terminal device. Here, the first image data may include the depth image data and the RGB image data, and the depth image data, the RGB image data and the pose data are synchronous in time, which can reduce the bandwidth of TCP communication.

In some embodiments, for S204, the operation of generating the 3D grid model according to the first image data and the pose data may include that, local point clouds corresponding to different observation points of the depth camera are determined based on the first image data and the pose data; fusion calculation is performed on the local point clouds corresponding to different observation points to determine a fusion value of at least one voxel in 3D space; and the 3D grid model is built based on the fusion value of at least one voxel.

Further, in some embodiments, for the 3D space including at least one voxel, the method may further include that, a geometry is built, and the geometry is voxelized to obtain the 3D space including at least one voxel.

It is to be noted that the geometry may specifically refer to a cuboid bounding box, which may completely surround an object to be reconstructed. Then, the cuboid bounding box is voxelized, so that the 3D space including at least one voxel can be obtained.

It is also to be noted that after the 3D space is obtained, that fusion calculation is performed on the local point clouds corresponding to different observation points to determine the fusion value of at least one voxel in the 3D space may include that, fusion calculation is performed on the local point clouds corresponding to different observation points by using a Truncated Signed Distance Function (TSDF) algorithm based on the 3D space, to obtain a TSDF value of at least one voxel.

Correspondingly, the operation of building the 3D grid model based on the fusion value of at least one voxel may include that: the 3D grid model is built according to the TSDF value of at least one voxel.

That is, in the embodiments of the disclosure, the local point clouds corresponding to different observation points may be fused into a TSDF algorithm model to obtain the TSDF value of at least one voxel in the 3D space; then, the 3D grid model is built according to the TSDF value of at least one voxel in the 3D space. Specifically, when the cuboid bounding box is voxelized to obtain the 3D space including at least one voxel, and then the local point clouds corresponding to different observation points are fused into the TSDF algorithm model based on the 3D space, the TSDF value of at least one voxel in the 3D space may be calculated using the TSDF algorithm.

Further, in some embodiments, the method may further include that: if the TSDF value is a positive value, it is determined that the voxel is between the depth camera and the surface of the object; and if the TSDF value is a negative value, it is determined that the voxel is outside a line between the depth camera and an object surface.

That is, in the embodiments of the disclosure, after the RGB image data, the depth image data and the pose data are obtained, a 3D network model is obtained by fusing through the TSDF algorithm model. Specifically, the local point clouds may be obtained by observing at different locations through the depth camera. In order to obtain the 3D network model of the environment, the local point clouds observed by the depth camera from different angles need to be fused. By using this fusion method to extract surface information from dense point clouds, information redundancy can be greatly reduced, a more accurate surface model (namely the 3D network model) can be obtained, and the influence of noise can be reduced. The TSDF model is an effective method for obtaining a reconstructed surface. Specifically, the distance from the voxel center (namely the center coordinates of the voxel) to the nearest surface may be expressed by the signed distance function TSDF. For a certain voxel in space, if the TSDF value is a positive value, the voxel center is between the depth camera and the surface of the object; if the TSDF value is a negative value, the voxel center is outside a line between the depth camera and the surface of the object. It is to be noted that the zero of the TSDF represents the surface of the object.

In a specific embodiment, the operation of performing fusion calculation on the local point clouds corresponding to different observation points by using the TSDF algorithm to obtain the TSDF value of at least one voxel may include that, a first TSDF value and a first weight value obtained at the current observation point are determined based on a first voxel, and a fusion TSDF value and a fusion weight value obtained after the fusion with the previous observation point are obtained; weighted average calculation is performed by using the TSDF algorithm according to the fusion TSDF value, the fusion weight value, the first TSDF value and the first weight value, to obtain a second TSDF value and a second weight value corresponding to the first voxel; and the fusion TSDF value and the fusion weight value are updated according to the second TSDF value and the second weight value, so as to fuse the local point cloud corresponding to the current observation point into the TSDF model, the first voxel being any one of at least one voxel in the 3D space.

It is to be noted that in a camera coordinate system, a light beam from the depth camera passes through the voxel at a specific location in space. The distance from the voxel center to the surface of the object may be approximated as the distance from the voxel center to the observation point along a light direction. Then, the distance from the voxel to the observation point of light is the TSDF value of the voxel. If the voxel is between the depth camera and the observation point, the TSDF value is a positive value; otherwise the TSDF value is a negative value. In this way, depth maps observed at different locations are fused. When different lights pass through the same voxel, the TSDF value is updated using a weighted average method as follows.

W ⁱ (v) =W ^i-1 (v) +α (2)

where α represents a first weight value obtained at the current observation point, which may also be called confidence, d (v) represents a first TSDF value obtained at the current observation point, namely observed quantity, that is, the distance from the voxel center to the surface of the object which is calculated according to the depth observed by the depth camera, D ⁱ (v) represents the TSDF value of the i-th point, namely a fusion TSDF value, and W ⁱ (v) represents a weight value of the i-th point, namely a fusion weight value.

That is, for each voxel, besides storing a distance function D (v) , it further retains a continuously accumulated weight value to superpose each measured confidence α. α is a constant, usually its value may be 1. In addition, the updated TSDF value here mainly refers to geometric information of the voxel. For color information of the voxel, the fused RGB color may also be obtained by the weighted average method.

In a mapping thread, when the new RGB image data and depth image data arrive, a fusion process of the TSDF model is continued. After obtaining the TSDF model, we need to extract information of the surface of the object from the TSDF model. The surface of the object may usually be represented by many interconnected triangles. The triangle of the surface of the object is extracted from the TSDF model through a marching cube. Specific steps are as follows.

FIG. 4 shows a schematic diagram of a marching cube according to an embodiment of the disclosure. As shown in FIG. 4, the cube provided in graph (a) may be composed of eight adjacent voxel centers in the space. In the TSDF mode, a stored value of each voxel center is the distance from a point to the nearest plane. If the surface of the object passes through the cube, the TSDF values stored in the vertexes of the cube at two sides of the surface of the object must be different. By calculating the product of the TSDF values of two endpoints on each side of the cube, it is possible to determine whether there are different symbols and thus whether there are surfaces passing through. If there is a surface passing through, the location of the surface passing through the side of the cube may be determined through the TSDF value of the vertex of the cube. After the locations of multiple surfaces are obtained, triangles may be connected to form a complete object surface, as shown in (b) in FIG. 4, and the triangles may be extracted from the vertexes of the surface of the object.

Briefly, in the embodiments of the disclosure, the TSDF is a common method for calculating the surface of the object in 3D reconstruction, which may use the TSDF to build space voxels, obtain the TSDF value of each voxel, and then extract the surface of the object using the above-mentioned marching cube method.

For the TSDF model, a large space (which may be called volume) is taken as the 3D model to be built, and the space may completely include an object model to be reconstructed. The volume is composed of many small voxels (namely small cubes) , and each voxel corresponds to a point in the space, which mainly involves two parameters: a distance value from the voxel to the nearest surface and a weight value when the voxel is updated.

First, the cuboid bounding box which can completely surround an object needing to be reconstructed needs to be built; then, the cuboid bounding box is voxelized, that is, the cuboid bounding box is divided into n equal parts, and the size of the voxel depends on of bounding box and the number of divided voxels; then each voxel is translated into a 3D position point in the world coordinate system.

Second, all the voxels are traversed. Taking a 3D position point v of a voxel in the world coordinate system as an example, the local point clouds observed at different locations are fused according to the above formula (1) and formula (2) until the final output result can accurately reconstruct the 3D grid model. In addition, it is to be noted that the characteristic of the TSDF model is that the calculation is very simple, there is no complicated calculation, and the details of the generated grid are kept well, and the accuracy is also high.

At S205, the server performs semantic segmentation on the 3D grid model to obtain a target semantic map.

It is to be noted that after rebuilding the 3D network model, the server may continue to perform semantic segmentation on the 3D grid model, so as to obtain the target semantic map. The target semantic map is used for displaying a virtual object in a physical environment.

In some embodiments, the operation of performing semantic segmentation on the 3D grid model to obtain the target semantic map may include that, the 3D grid model is input to a neural network structure; and in the neural network structure, semantic segmentation and instance segmentation are performed by way of point-wise feature learning to obtain the target semantic map.

In a specific embodiment, the operation of performing semantic segmentation and instance segmentation by way of point-wise feature learning to obtain the target semantic map may include that, a 3D sparse convolution operation is performed on the 3D grid model to determine semantic information, feature embedding information, spatial embedding information and occupancy information; supervoxel grouping is performed on the 3D grid model by using an image segmentation algorithm, so as to obtain supervoxel information;

covariance estimation is performed according to the feature embedding information and the spatial embedding information to obtain target embedding information; and clustering operation is performed according to the target embedding information, the occupancy information and the supervoxel information to determine instance information; the target semantic map is obtained according to the semantic information and the instance information.

It is to be noted that a simultaneous semantic segmentation architecture is performed on the server to process the rebuilt 3D grid model. Here, the neural network structure may be a 3D UNet structure, or also be another neural network structure, which is not limited in the embodiments of the disclosure.

It is also to be noted that object detection needs to provide not only the class of an object in the image, but also the location of the object (bounding box) . The semantic segmentation needs to predict the class label, to which each pixel of an input image belongs. The instance segmentation also needs to distinguish different individuals in the same class based on the semantic segmentation. For example, after the semantic segmentation, it may be determined three people all belong to the label of people, but each person may be regarded as one instance. Here, after the target semantic map is obtained, the point clouds may be labeled to distinguish different instances.

Exemplarily, FIG. 5 shows an architecture diagram of semantic segmentation and instance segmentation according to an embodiment of the disclosure. Here, for an input point cloud, an RGB feature is taken as input, and the 3D UNet structure is used for point-wise feature learning. The learned feature is decoded into various representations through a fully connected layer, which may be used for 3D instance segmentation. As shown in FIG. 5, by taking 3D geometric information as input, and a different representation of each input voxel may be obtained by performing point-wise feature learning using the 3D UNet structure, the representation includes: the semantic information, the feature embedding information, the spatial embedding information and the occupancy information, the semantic information aiming to assign the class labels, the feature embedding information and the spatial embedding information aiming to fuse feature and space information, and the occupancy information aiming to indicate the probability that there is actually an object in the input voxel; then, the covariance estimation is performed on the feature embedding information and the spatial embedding information, which aims to learn an embedding vector, which considers feature and spatial embedding, and is used for the instance segmentation; after the combination of the feature and spatial embedding is obtained, it is also needed to weight and fuse the occupancy information obtained in advance, so as to determine a weight value (expressed by w _i, j) , and the larger w _i, j, the more likely i and j belong to the same instance; in addition, it is also needed to perform image segmentation and clustering on the 3D geometric information to obtain supervoxel grouping; and then, an initial map may be defined according to the supervoxel; here, an embedding feature of the supervoxel is obtained by averaging the sum of feature embedding and spatial embedding of other voxels in the supervoxel; the initial map may be represented by (V, E, W) , the embedding feature of the supervoxel ∈V, the side between the supervoxel i and the supervoxel j∈E, and the weight value w _i, j∈W; next, a final map may be obtained by performing a map operation on the initial map. In the process, specifically, for the sides in E, the side with the maximum w _i, j is selected, which means that i and j are most likely the same instance. A threshold T is set, if the weight value w _i, j is greater than T, then two nodes may be merged, after that, the map is updated continuously until there is no side of which the weight value is greater than T, at this time, the final result may be output.

That is, in FIG. 5, taking the 3D geometric information as input, point-wise predictions of instance level semantic information prediction are produced. Considering that a 3D metric space provides a more reliable 3D scene perception than projective observation based on a 2D image, 3D occupancy information may be used to represent the number of voxels occupied by each instance. Such occupancy information represents the inherent and essential properties of each 3D instance. The occupancy information is encoded into a traditional 3D instance segmentation pipeline. In an occupancy perception process, both a learning phase and a clustering phase make full use of the characteristics of the occupancy information. In addition, in the learning phase, a color 3D scene may be taken as input, and a hybrid vector of each voxel is extracted using a spatial sparse convolution method. The learning phase not only learns the classical embedding (including the spatial embedding and the feature embedding) , but also generates a piece of occupancy information, which implies the volume of an object level. In order to make full use of the semantic information and the geometric information, the feature embedding and the spatial embedding are explicitly supervised with different objectives, and further combined through the covariance estimation for both a feature embedding distance and a spatial embedding distance. For the clustering phase, the 3D input point cloud is grouped into super-voxels based on the geometric and appearance constraints using a graph-based segmentation algorithm. Then, to merge the super-voxels with similar feature embedding into the same instance, an adaptive threshold is utilized to evaluate the similarity between the embedding distance and the occupancy size. Then, to merge the super-voxels with similar feature embedding into the same instance, an adaptive threshold is utilized to evaluate the similarity between the embedding distance and the occupancy size. Aided by the reliable comparison between the predicted occupancy size and the clustered occupancy size, the clustering encourages hard samples to be correctly clustered and eliminates the false positives where partial instances are recognized as an independent instance.

It is also to be noted that the TSDF can only provide the 3D grid model. The above content describes how to use the 3D grid model generated by the TSDF to obtain the semantic information and the instance information. The neural network structure here may be divided into two parts, namely the semantic information and the instance information. The semantic information is, for example, table, chair, etc., and the instance information is two objects, etc. After the semantic segmentation and the instance segmentation, information of two chairs and one table may be obtained. In addition, a piece of occupancy information is also provided here, which is defined as the number of voxels occupied by each instance. Based on this, a solution of 3D instance segmentation based on occupancy perception is proposed. The occupancy information indicates the probability that there is actually an object in a voxel. The architecture shown in FIG. 5 not only uses the spatial embedding and the feature embedding, but also considers the occupancy information, so the instance can be more accurately segmented, and then the target semantic map can be rebuilt.

Further, after the target semantic map is obtained, a potential AR application of the semantic information may also be used, for example, a virtual object is moved to a corresponding position in which the user is interested and displayed. Therefore, in some embodiments, as shown in FIG. 6, after S205, the method may further include the following steps.

At S601, the server obtains target coordinate information of an object in which the user is interested.

At S602, the server sends the target coordinate information to the terminal device.

At S603, the terminal device generates a rendered image according to the target coordinate information and the virtual object, and sends the rendered image to the wearable device, so that the virtual object is displayed at a position corresponding to the target coordinate information through the wearable device.

It is to be noted that the server may obtain the target coordinate information of the object in which the user is interested from the target semantic map, and then sends the target coordinate information to the terminal device, so that the terminal device generates and displays the rendered image including the virtual object.

It is also to be noted that the terminal device may receive the target coordinate information sent by the server based on the TCP connection, generate the rendered image according to the target coordinate information and the virtual object, and then send the rendered image to the wearable device, so that the virtual object is displayed at a position corresponding to the target coordinate information through the wearable device.

It is to be understood that the obtained instance and semantic information may be used for the AR application (for example, the interaction between the virtual object and the physical environment) . Exemplarily, the embodiments of the disclosure may use the semantic information and the instance information to automatically identify the coordinate information of a table and a wall, and then move the virtual object to the corresponding position through a simple command, so as to display the virtual object in the physical environment. FIG. 7 shows an application diagram of an AR application according to an embodiment of the disclosure. As shown in FIG. 7, first, data are collected and processed by the vision enhancement system, and then the 3D network model is built based on these data to obtain the 3D grid model, as shown in (a) ; second, the semantic segmentation and the instance segmentation are performed on the 3D network model to determine the semantic information and the instance information of the 3D grid model, as shown in (b) ; third, after the semantic information and the instance information of the model is obtained, interaction with the model is performed, specifically including that: the target coordinate information of the object in which the user is interested is determined, and then the rendered image including the virtual objects is generated according to the target coordinate information and the virtual objects (for example, an oil painting and a display) ; for example, the objects in which the user is interested are a table and a wall, the coordination information of the table and the wall is determined by detecting the table and the wall, and then the oil painting is hung on the wall, and the display is placed on the table. (c) shows a result of interaction with this model. In addition, it is also to be noted that the virtual objects, such as the oil painting and the display, are simulated. For example, some software applications provide some virtual objects that may be displayed in a real environment during man-machine interaction.

In this way, after generating the rendered image, the terminal device may send the rendered image to the AR glasses. Specifically, if there is the semantic information and the instance information, the virtual object can be displayed on the physical object of interest through a monitor or the AR glasses connected to the server. The target coordinate information of the physical object of interest is sent to the terminal device. The rendered image of the virtual object displayed on the target coordinate information is generated on the terminal device, and the generated rendered image is sent to the AR glasses, so that the user can visualize the virtual object in the physical environment on the AR glasses.

The embodiments of the disclosure provide a method for semantic map building, which may include that: at the terminal device side, the collected data of the wearable device is obtained, the collected data including the first image data and the second image data; the second image data is processed to generate the pose data; the first image data and the pose data are sent to the server; at the server side, the 3D grid model is generated according to the first image data and the pose data; and the semantic segmentation is performed on the 3D grid model to obtain the target semantic map which is used for displaying the virtual object in the physical environment. In this way, the pose data is also used in building of the target semantic map, and thus a more complete scene reconstruction can be achieved, the accuracy of the whole semantic map is improved, and a virtual object can be displayed on a specific physical object better. Additionally, since the scene reconstruction and the semantic segmentation are performed in the server, it can reduce memory usage and power consumption of the terminal device.

Based on the same inventive concept as the above embodiments, FIG. 8 shows a detailed flowchart of a method for semantic map building according to an embodiment of the disclosure. As shown in FIG. 8, taking that the wearable device is the AR glasses as an example, the detailed process may include the following steps.

At S801, the AR glasses acquire the RGBD image data, the fisheye image data and the IMU data.

At S802, the terminal device obtains the RGBD image data, the fisheye image data and the IMU data, and uses the SLAM system to generate the pose data according to the fisheye image data and the IMU data.

At S803, the server obtains the RGBD image data and the pose data, and generates the 3D grid model.

At S804, the server performs the semantic segmentation to the 3D grid model to obtain the target semantic map.

At S805, the server displays an interaction result on a monitor according to the interaction between the target semantic map and the virtual object.

At S806, the terminal device obtains the coordinate information of the object of interest and generates the rendered image of the virtual object in the coordinate information.

At S807, the AR glasses display the rendered image.

It is to be noted that the execution body of S801 and S807 is the AR glasses, the execution body of S802 and S806 is the terminal device, and the execution body of S803, S804 and S805 is the server.

It is also to be noted that, in the embodiments of the disclosure, the AR glasses and the terminal device are in a wired connection through the physical cable, and the RGBD image data, the fisheye image data and the IMU data acquired by the AR glasses may be sent to the terminal device. The RGBD image data here may include the RGB image data and the depth image data. In addition, the rendered image generated by the terminal device may also be sent to the AR glasses.

It is also to be noted that, in the embodiments of the disclosure, the terminal device and the server are in a wireless connection through the TCP, and the RGBD image data and the pose data may be sent to the server. In addition, the coordinate information of the object in which the user is interested, which is determined by the server, may also be sent to the terminal device.

That is, in the embodiments of the disclosure, the terminal device obtains the RGB image data, the depth image data, the fisheye image data, the IMU data and their time stamps from the AR glasses through different threads. The pose data is generated through the SLAM system. This system uses the fisheye image data and the IMU data to obtain the reliable 6-DoF pose data. Then, the RGB image data, the depth image data and the pose data which are synchronous are sent to the server through the TCP connection. The server rebuilds the 3D grid model according to the obtained pose data and RGBD image data. After rebuilding the 3D grid model, the server uses information like RGB and grid to run effective semantic segmentation.

Here, the semantic information may be used for man-machine interaction in different applications (for example, displaying the virtual object on a table or a wall) . The technical solution integrates data acquisition of the AR glasses, simultaneous 3D reconstruction, simultaneous semantic segmentation and instance segmentation into a whole system. After the data is acquired through the network, the data of the AR glasses may be reconstructed on line to achieve simultaneous semantic segmentation and instance segmentation.

Briefly, in the embodiments of the disclosure, the system architecture may use the terminal device with the AR glasses to provide a large-scale simultaneous semantic model. The TCP communication sends synchronous data to the server efficiently. A semantic 3D model is built in real time based on the RGBD data on a visual inertial SLAM system and the server. In this way, a real scene model including the semantic information may be captured in real time by using a terminal device with the AR glasses. In addition, when related data is transmitted using the TCP communication, scene reconstruction is not limited to a small scale. With the powerful server, the semantic segmentation may be realized in real time, and the 3D reconstruction may be scaled up; besides, the reconstructed semantic 3D model may be applied to virtual object display, semantic building model reconstruction, and other fields.

In addition, the embodiments of the disclosure may also perform reconstruction directly on the terminal device, and then send a reconstruction result to the server for semantic segmentation. But the transmission of the 3D model will be much slower than the transmission of camera data. The semantic segmentation may run on the server offline. In this way, the user needs to wait for a semantic segmentation result of a scene, and the scene cannot grow dynamically. In addition, the whole semantic 3D model may also be sent back to the terminal device for different types of applications; But the transmission speed of the 3D models is low. In addition, regardless of power consumption, time cost and memory usage, the optimal method is processing all content on the terminal device rather than on the server. But the terminal device does not have so many computing resources, resulting in a time delay and running out of the memory and battery.

The embodiments of the disclosure provide a method for semantic map building. The specific implementation of the above embodiments is described in detail through the above embodiment. It can be seen that, according to the technical solution of the above embodiment, the pose data is also used in the building of target semantic map, so that a more complete scene reconstruction can be achieved, the accuracy of the whole semantic map is improved, and the virtual object can be displayed on the specific physical object better. Additionally, since the scene reconstruction and the semantic segmentation are performed in the server, it can reduce memory usage and power consumption of the terminal device.

Based on the same inventive concept as the above embodiments, FIG. 9 shows a composition structure diagram of a server 90 according to an embodiment of the disclosure. As shown in FIG. 9, the server 90 may include: a first receiving unit 901, a modeling unit 902 and a segmenting unit 903.

The first receiving unit 901 is configured to receive the first image data and the pose data sent by the terminal device.

The modeling unit 902 is configured to generate the 3D grid model according to the first image data and the pose data.

The segmenting unit 903 is configured to perform the semantic segmentation on the 3D grid model to obtain the target semantic map, the target semantic map being used for displaying the virtual object in the physical environment.

In some embodiments, the first receiving unit 901 is specifically configured to establish a TCP connection with the terminal device, and receive the first image data and the pose data sent by the terminal device based on the TCP connection.

In some embodiments, as shown in FIG. 9, the server 90 may also include a fusing unit 904, configured to determine the local point clouds corresponding to different observation points of the depth camera based on the first image data and the pose data, and perform the fusion calculation om the local point clouds corresponding to different observation points to determine the fusion value of at least one voxel in the 3D space.

The modeling unit 902 is specifically configured to build the 3D grid model based on the fusion value of at least one voxel.

In some embodiments, as shown in FIG. 9, the server 90 may further include a voxelizing unit 905, configured to build a geometry, and voxelize the geometry to obtain the 3D space including at least one voxel.

Correspondingly, the fusing unit 904 is further configured to perform the fusion calculation on the local point clouds corresponding to different observation points by using the TSDF algorithm to obtain the TSDF value of at least one voxel.

The modeling unit 902 is further configured to build the 3D grid model according to the TSDF value of at least one voxel.

In some embodiments, the fusing unit 904 is specifically configured to: determine the first TSDF value and the first weight value obtained at a current observation point, and obtain the fusion TSDF value and the fusion weight value obtained after fusion with a previous observation point; perform the weighted average calculation by using the TSDF algorithm according to the fusion TSDF value, the fusion weight value, the first TSDF value and the first weight value to obtain the second TSDF value and the second weight value corresponding to the first voxel; and update the fusion TSDF value and the fusion weight value according to the second TSDF value and the second weight value, so as to fuse the local point cloud corresponding to the current observation point into the TSDF model. The first voxel is any one of at least one voxel in the 3D space.

In some embodiments, the modeling unit 902 is specifically configured to, if the TSDF value is a positive value, determine that the voxel is between the depth camera and a surface of the object, and if the TSDF value is a negative value, determine that the voxel is outside a line between the depth camera and the surface of the object.

In some embodiments, the first image data includes the depth image data and the RGB image data. The depth image data, the RGB image data and the pose data are synchronous in time.

In some embodiments, the segmenting unit 903 is further configured to input the 3D grid model to the neural network structure, and in the neural network structure, perform the semantic segmentation and the instance segmentation by way of point-wise feature learning to obtain the target semantic map.

In some embodiments, the segmenting unit 903 is specifically configured to: perform the 3D sparse convolution operation on the 3D grid model to determine the semantic information, the feature embedding information, the spatial embedding information and the occupancy information; perform the supervoxel grouping to the 3D grid model by using the image segmentation algorithm, so as to obtain the supervoxel information; perform the covariance estimation according to the feature embedding information and the spatial embedding information to obtain the target embedding information; perform the clustering operation according to the target embedding information, the occupancy information and the supervoxel information to determine the instance information; and obtain the target semantic map according to the semantic information and the instance information.

In some embodiments, as shown in FIG. 9, the server 90 may further include a first sending unit 906, configured to obtain the target coordinate information of the object in which the user is interested from the target semantic map, and send the target coordinate information to the terminal device, the target coordinate information being used for the terminal device to generate and display the rendered image including the virtual object.

It can be understood that, in the embodiments of the disclosure, the "unit" may be a part of circuit, a part of processor, a part of program or software, etc., of course, it may be a module, and it may also be non-modular. Moreover, all parts in the present embodiment may be integrated in a processing unit; or the units exist separately and physically; or two or more than two units are integrated in a unit. The integrated unit may be realized in form of hardware or in form of software function module.

If the integrated units are implemented by software function modules and are not sold or used as independent products, they can also be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present embodiment substantially or the part making a contribution to the traditional art can be embodied in the form of software product; the computer software product is stored in a storage medium and includes a number of instructions to make a computer device (which can be a personal computer, a server or a network device, etc. ) of a processor perform all or part of the steps of the method in the embodiments. The foregoing storage medium includes any medium that can store program code, such as a U disk, a removable hard disk, a Read Only Memory (ROM) , a Random Access Memory (RAM) , a magnetic disk, or an optical disc.

Therefore, the embodiments of the disclosure provide a computer storage medium, which is applied to the server 90. The computer storage medium stores a computer program. The computer program, when executed by a first processor, implements the method described in any above embodiment.

Based on the composition of the server 90 and the computer storage medium, FIG. 10 shows a structure diagram of specific hardware of a server 90 according to an embodiment of the disclosure. As shown in FIG. 10, the server 90 may include: a first communication interface 1001, a first memory 1002 and a first processor 1003. The components are coupled together via a first bus system 1004. It may be understood that the first bus system 1004 is configured to implement connection communication among these components. The first bus system 1004 includes a data bus and further includes a power bus, a control bus and a state signal bus. However, for clear description, various buses in FIG. 10 are marked as the first bus system 1004.

The first communication interface 1001 is configured to receive and send a signal in the process of receiving and sending messages with other external network elements.

The first memory 1002 is configured to store a computer program capable of running on the first processor 1003.

When running the computer program, the first processor 1003 is configured to: receive the first image data and the pose data sent by the terminal device; and generate the 3D grid model according to the first image data and the pose data; perform the semantic segmentation on the 3D grid model to obtain the target semantic map, the target semantic map being used for displaying the virtual object in the physical environment.

It can be understood, the first memory 1002 in the embodiment of the application may be a volatile memory or a nonvolatile memory, or may include both the volatile and nonvolatile memories. The nonvolatile memory may be a ROM, a PROM, an Erasable PROM (EPROM) , an EEPROM or a flash memory. The volatile memory may be a RAM, and is used as an external high-speed cache. It is exemplarily but unlimitedly described that RAMs in various forms may be adopted, such as a Static RAM (SRAM) , a Dynamic RAM (DRAM) , a Synchronous DRAM (SDRAM) , a Double Data Rate SDRAM (DDRSDRAM) , an Enhanced SDRAM (ESDRAM) , a Synchlink DRAM (SLDRAM) and a Direct Rambus RAM (DR RAM) . The first memory 1002 of the system and method described in the application is intended to include, but not limited to, memories of these and any other proper types.

The first processor 1003 may be an integrated circuit chip with a signal processing capability. In an implementation process, the steps of the method may be accomplished by an integrated logic circuit of hardware in the first processor 1003 or an instruction in a software form. The first processor 1003 may be a universal processor, a Digital Signal Processor (DSP) , an Application Specific Integrated Circuit (ASIC) , a Field Programmable Gate Array (FPGA) or another programmable logical device, discrete gate or transistor logical device and discrete hardware component. Each method, step and logical block diagram disclosed in the embodiments of the disclosure may be implemented or executed. The universal processor may be a microprocessor, or the processor may also be any conventional processor and the like. The steps of the method disclosed in combination with the embodiments of the disclosure may be directly embodied to be executed and completed by a hardware decoding processor or executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a mature storage medium in this field such as an RAM, a flash memory, an ROM, a PROM or EEPROM and a register. The storage medium is located in the first memory 1002. The first processor 1003 reads information from the first memory 1002 and completes the steps of the above method in combination with the hardware of the processor.

It is to be understood that these embodiments described here may be implemented by hardware, software, firmware, middleware, microcode, or a combination thereof. For hardware implementation, the processing unit may be realized in one or more of an ASIC, a DSP, a DSP Device (DSPD) , a Programmable Logic Device (PLD) , an FPGA, a universal processor, a controller, a micro-controller, a microprocessor, other electronic units for implementing the functions of the application or a combination thereof. For software implementation, the technology described in the specification can be implemented through modules (such as procedures and functions) that perform the functions described in the application. A software code can be stored in the memory and executed by the processor. The memory can be implemented in or outside the processor.

Optionally, as another embodiment, the first processor 1003 is further configured to perform, when running the computer program, the method described in any above embodiment.

The embodiments of the disclosure provide a server, which may include: a first receiving unit, a modeling unit and a segmenting unit. In this way, the pose data is also used in building of the target semantic map, thus a more complete scene reconstruction can be achieved, the accuracy of the whole semantic map is improved, and a virtual object can be displayed on a specific physical object better. Additionally, since the scene reconstruction and the semantic segmentation are performed in the server, it can reduce memory usage and power consumption of the terminal device.

Based on the same inventive concept as the above embodiments, FIG. 11 shows a composition structure diagram of a terminal device 110 according to an embodiment of the disclosure. As shown in FIG. 11, the terminal device 110 may include: an obtaining unit 1101, a data processing unit 1102 and a second sending unit 1103.

The obtaining unit 1101 is configured to obtain the collected data of the wearable device, the collected data including the first image data and the second image data.

The data processing unit 1102 is configured to process the second image data to generate the pose data.

The second sending unit 1103 is configured to send the first image data and the pose data to the server, the first image data and the pose data being used for the server to build the target semantic map.

In some embodiments, the first image data includes the depth image data and the RGB image data, and the second image data includes the fisheye image data and the inertial sensor data. Correspondingly, the obtaining unit 1101 is specifically configured to obtain the depth image data and time stamp information corresponding to the depth image data through the first thread, obtain the RGB image data and time stamp information corresponding to the RGB image data through the second thread, obtain the fisheye image data and time stamp information corresponding to the fisheye image data through the third thread, and obtain the inertial image data and time stamp information corresponding to the inertial image data through the fourth thread. The time stamp information is used for measuring whether the depth image data, the RGB image data, the fisheye image data and the inertial sensor data are synchronous in time.

In some embodiments, the data processing unit 1102 is specifically configured to perform the pose calculation on the fisheye image data and the inertial sensor data by using the SLAM system, so as to generate the pose data.

In some embodiments, as shown in FIG. 11, the terminal device 110 may also include a synchronizing unit 1104, configured to: determine a first time stamp information corresponding to the pose data, select the depth image data in synchronization with the first time stamp information from the depth image data and time stamp information corresponding to the depth image data, select the RGB image data in synchronization with the first time stamp information from the RGB image data and time stamp information corresponding to the RGB image data, and determine the first image data according to the selected depth image data and the selected RGB image data. The first image data and the pose data are synchronous in time.

In some embodiments, the second sending unit 1103 is specifically configured to establish the TCP connection with the server, and send the first image data and the pose data to the server based on the TCP connection.

In some embodiments, as shown in FIG. 11, the terminal device 110 may further include a second receiving unit 1105, configured to receive the target coordinate information sent by the server based on the TCP connection.

The data processing unit 1102 is further configured to generate the rendered image according to the target coordinate information and the virtual object, and send the rendered image to the wearable device, so that the virtual object is displayed at a position corresponding to the target coordinate information through the wearable device.

If the integrated unit is implemented by software function modules, and the software function modules are sold or used as independent products, they can also be stored in a computer readable storage medium. Based on this understanding, the embodiments of the disclosure provide a computer storage medium, which is applied to the terminal device 110. The computer storage medium stores a computer program. The computer program, when executed by a second processor, implements the method described in any above embodiment.

Based on the composition of the terminal device 110 and the computer storage medium, FIG. 12 shows a structure diagram of specific hardware of a terminal device 110 according to an embodiment of the disclosure. As shown in FIG. 12, the terminal device 110 may include: a second communication interface 1201, a second memory 1202 and a second processor 1203. The components are coupled together via a second bus system 1204. It may be understood that the second bus system 1204 is configured to implement connection communication among these components. The second bus system 1204 includes a data bus and further includes a power bus, a control bus and a state signal bus. However, for clear description, various buses in FIG. 12 are marked as the second bus system 1204.

The second communication interface 1201 is configured to receive and send a signal in the process of receiving and sending messages with other external network elements.

The second memory 1202 is configured to store a computer program capable of running on the second processor 1203.

When running the computer program, the second processor 1203 is configured to: obtain the collected data of the wearable device, the collected data including the first image data and the second image data; process the second image data to generate the pose data; and send the first image data and the pose data to the server, the first image data and the pose data being used for the server to build the target semantic map.

Optionally, as another embodiment, the second processor 1203 is further configured to perform, when running the computer program, the method described in any above embodiment.

It can be understood that, the hardware functions of the second memory 1202 and the first memory 1002 are similar, and the hardware functions of the second processor 1203 and the first processor 1003 are similar. Elaborations are omitted herein.

The embodiments of the disclosure provide a terminal device, which may include a data processing unit and a second sending unit. In this way, the pose data is also used in building of the target semantic map, thus a more complete scene reconstruction can be achieved, the accuracy of the whole semantic map is improved, and a virtual object can be displayed on a specific physical object better. Additionally, since the scene reconstruction and the semantic segmentation are performed in the server, it can reduce memory usage and power consumption of the terminal device.

It is to be noted that terms "include" and "contain" or any other variant thereof is intended to cover nonexclusive inclusions herein, so that a process, method, object or device including a series of components not only includes those components but also includes other components which are not clearly listed or further includes components intrinsic to the process, the method, the object or the device. Under the condition of no more limitations, an component defined by the statement "including a/an......" does not exclude existence of the same other components in a process, method, object or device including the component.

The sequence numbers of the embodiments of the disclosure are just for describing, instead of representing superiority-inferiority of the embodiments.

The methods disclosed in some method embodiments provided in the application may be freely combined without conflicts to obtain new method embodiments.

The characteristics disclosed in some product embodiments provided in the application may be freely combined without conflicts to obtain new product embodiments.

The characteristics disclosed in some method or device embodiments provided in the application may be freely combined without conflicts to obtain new method embodiments or device embodiments.

The above is only the specific implementation manner of the application and not intended to limit the scope of protection of the application. Any variations or replacements apparent to those skilled in the art within the technical scope disclosed by the application shall fall within the scope of protection of the application. Therefore, the scope of protection of the application shall be subject to the scope of protection of the claims.

INDUSTRIAL APPLICABILITY

In the embodiments of the disclosure, at a terminal device side, collected data of a wearable device is obtained, the collected data including first image data and second image data; the second image data is processed to generate pose data; and the first image data and the pose data are sent to a server, so that the server builds a target semantic map. At the server side, the first image data and the pose data sent by the terminal device are received; a 3D grid model is generated according to the first image data and the pose data; and semantic segmentation is performed on the 3D grid model to obtain a target semantic map, the target semantic map being used for display of a virtual object in a physical environment. In this way, the pose data is also used in building of the target semantic map, thus a more complete scene reconstruction can be achieved, the accuracy of the whole semantic map is improved, and a virtual object can be displayed on a specific physical object better. Additionally, since the scene reconstruction and the semantic segmentation are performed in the server, it can reduce memory usage and power consumption of the terminal device.

Claims

A method for semantic map building, applied to a server, the method comprising:

receiving first image data and pose data sent by a terminal device;

generating a Three Dimension (3D) grid model according to the first image data and the pose data; and

performing semantic segmentation on the 3D grid model to obtain a target semantic map, wherein the target semantic map is used for displaying a virtual object in a physical environment.
The method of claim 1, wherein generating the 3D grid model according to the first image data and the pose data comprises:

determining local point clouds corresponding to different observation points of a depth camera based on the first image data and the pose data;

performing fusion calculation on the local point clouds corresponding to different observation points to determine a fusion value of at least one voxel in 3D space; and

building the 3D grid model based on the fusion value of at least one voxel.
The method of claim 2, further comprising:

building a geometry; and

voxelizing the geometry to obtain the 3D space comprising at least one voxel;

correspondingly, wherein performing fusion calculation on the local point clouds corresponding to different observation points to determine the fusion value of at least one voxel in the 3D space comprises:

performing, by using a Truncated Signed Distance Function (TSDF) algorithm, fusion calculation on the local point clouds corresponding to different observation points based on the 3D space, to obtain a TSDF value of at least one voxel; and

wherein building the 3D grid model based on the fusion value of at least one voxel comprises:

building the 3D grid model according to the TSDF value of at least one voxel.
The method of claim 3, wherein performing, by using the TSDF algorithm, fusion calculation to the local point clouds corresponding to different observation points to obtain the TSDF value of at least one voxel, comprises:

determining, based on a first voxel, a first TSDF value and a first weight value obtained at a current observation point, and obtaining a fusion TSDF value and a fusion weight value obtained after fusion with a previous observation point;

performing, by using the TSDF algorithm, weighted average calculation according to the fusion TSDF value, the fusion weight value, the first TSDF value and the first weight value, to obtain a second TSDF value and a second weight value corresponding to the first voxel; and

updating the fusion TSDF value and the fusion weight value according to the second TSDF value and the second weight value, to fuse the local point cloud corresponding to the current observation point into the TSDF model, wherein the first voxel is any one of at least one voxel in the 3D space.
The method of claim 3, wherein building the 3D grid model according to the TSDF value of at least one voxel comprises:

when the TSDF value is a positive value, determining that the voxel is between the depth camera and a surface of the object; and

when the TSDF value is a negative value, determining that the voxel is outside a line between the depth camera and the surface of the object.
The method of claim 1, wherein the first image data comprises depth image data and RGB image data; wherein, the depth image data, the RGB image data and the pose data are synchronous in time.
The method of claim 1, wherein performing semantic segmentation on the 3D grid model to obtain the target semantic map comprises:

inputting the 3D grid model to a neural network structure;

in the neural network structure, performing semantic segmentation and instance segmentation by way of point-wise feature learning to obtain the target semantic map.
The method of claim 7, wherein performing semantic segmentation and instance segmentation by way of point-wise feature learning to obtain the target semantic map comprises:

performing a 3D sparse convolution operation on the 3D grid model to determine semantic information, feature embedding information, spatial embedding information and occupancy information;

performing, by using an image segmentation algorithm, supervoxel grouping on the 3D grid model, so as to obtain supervoxel information;

performing covariance estimation according to the feature embedding information and the spatial embedding information to obtain target embedding information;

performing a clustering operation according to the target embedding information, the occupancy information and the supervoxel information, to determine instance information; and

obtaining the target semantic map according to the semantic information and the instance information.
The method of any one of claims 1 to 8, wherein after obtaining the target semantic map, the method further comprises:

obtaining target coordinate information of an object in which a user is interested from the target semantic map; and

sending the target coordinate information to the terminal device, wherein the target coordinate information is used for the terminal device to generate and display a rendered image comprising the virtual object.
A method for semantic map building, applied to a terminal device, the method comprising:

obtaining collected data of a wearable device; wherein, the collected data comprises first image data and second image data;

processing the second image data to generate pose data; and

sending the first image data and the pose data to a server, wherein first image data and the pose data are used for the server to build a target semantic map.
The method of claim 10, wherein the first image data comprises depth image data and RGB image data, and the second image data comprises fisheye image data and inertial sensor data;

correspondingly, wherein obtaining the collected data of the wearable device comprises:

obtaining the depth image data and time stamp information corresponding to the depth image data through a first thread;

obtaining the RGB image data and time stamp information corresponding to the RGB image data through a second thread;

obtaining the fisheye image data and time stamp information corresponding to the fisheye image data through a third thread; and

obtaining the inertial sensor data and time stamp information corresponding to the inertial sensor data through a fourth thread;

wherein, the time stamp information is used for measuring whether the depth image data, the RGB image data, the fisheye image data and the inertial sensor data are synchronous in time.
The method of claim 11, wherein processing the second image data to generate the pose data comprises:

performing, by using a Simultaneous Localization And Mapping (SLAM) system, pose calculation on the fisheye image data and the inertial sensor data, so as to generate the pose data.
The method of claim 12, wherein before sending the first image data and the pose data to the server, the method further comprises:

determining first time stamp information corresponding to the pose data;

selecting depth image data in synchronization with the first time stamp information from the depth image data and the time stamp information corresponding to the depth image data;

selecting RGB image data in synchronization with the first time stamp information from the RGB image data and the time stamp information; and

determining the first image data according to the selected depth image data and the selected RGB image data; wherein, the first image data and the pose data are synchronous in time.
The method of any one of claims 10 to 13, further comprising:

receiving target coordinate information sent by the server;

generating a rendered image according to the target coordinate information and a virtual object; and

sending the rendered image to the wearable device, so that the virtual object is displayed at a position corresponding to the target coordinate information through the wearable device.
A server, comprising: a first receiving unit, a modeling unit and a segmenting unit; wherein,

the first receiving unit is configured to receive first image data and pose data sent by a terminal device;

the modeling unit is configured to generate a Three Dimension (3D) grid model according to the first image data and the pose data; and

the segmenting unit is configured to perform semantic segmentation on the 3D grid model to obtain a target semantic map, wherein the target semantic map is used for displaying a virtual object in a physical environment.
A server, comprising: a first memory and a first processor; wherein,

the first memory is configured to store a computer program capable of running in the first processor;

the first processor is configured to execute, when running the computer program, the method of any one of claims 1 to 9.
A terminal device, comprising: an obtaining unit, a data processing unit and a second sending unit; wherein,

the obtaining unit is configured to obtain collected data of a wearable device, wherein the collected data comprises first image data and second image data;

the data processing unit is configured to process the second image data to generate pose data;

the second sending unit is configured to send the first image data and the pose data to a server, so that the server builds a target semantic map.
A terminal device, comprising: a second memory and a second processor; wherein,

the second memory is configured to store a computer program capable of running on the second processor;

the second processor is configured to execute, when running the computer program, the method of any one of claims 10 to 14.
A computer readable storage medium, in which a computer program is stored, the computer program implements, when executed by a first processor, the method of any one of claims 1 to 9, or implements, when executed by a second processor, the method of any one of claims 10 to 14.