WO2024103708A1

WO2024103708A1 - Positioning method, terminal device, server, and storage medium

Info

Publication number: WO2024103708A1
Application number: PCT/CN2023/100387
Authority: WO
Inventors: 施文哲; 陆平; 乔秀全; 黄亚坤
Original assignee: 中兴通讯股份有限公司
Priority date: 2022-11-15
Filing date: 2023-06-15
Publication date: 2024-05-23
Also published as: CN118052867A

Abstract

Provided is a positioning method. The method comprises: obtaining a target local point cloud map from a server, wherein the target local point cloud map is a local point cloud map closest to an initial position of a terminal device in a plurality of local point cloud maps; and after moving into the coverage of the target local point cloud map, on the basis of a current video frame image and the target local point cloud map, determining pose information corresponding to the current video frame image.

Description

Positioning method, terminal device, server and storage medium

This disclosure claims priority to Chinese patent application No. 202211429870.1, filed on November 15, 2022, the entire contents of which are incorporated by reference into this application.

Technical Field

The present disclosure relates to the field of communication technology, and in particular to a positioning method, terminal equipment, server and storage medium.

Background technique

In recent years, with the rapid development of computer technology, augmented reality (AR) technology running on terminal devices has emerged. AR technology has the advantages of cross-platform, strong universality and easy dissemination.

Summary of the invention

In a first aspect, a positioning method is provided, the method comprising:

Obtain a target local point cloud map from the server, where the target local point cloud map is a local point cloud map that is closest to the initial position of the terminal device among the multiple local point cloud maps;

After moving into the coverage of the target local point cloud map, the pose information corresponding to the current video frame image is determined based on the current video frame image and the target local point cloud map.

In a second aspect, a positioning method is provided, the method comprising:

Determine the initial location of the terminal device;

According to the initial position of the terminal device, determining a target local point cloud map which is closest to the initial position of the terminal device from the multiple local point cloud maps;

The target local point cloud map is sent to the terminal device.

In a third aspect, a positioning device is provided, the device comprising:

The communication unit is used to obtain a target local point cloud map from a server, where the target local point cloud map is a local point cloud map that is closest to an initial position of a terminal device among multiple local point cloud maps.

The processing unit is used to determine the posture information corresponding to the current video frame image based on the current video frame image and the target local point cloud map after moving into the coverage range of the target local point cloud map.

In a fourth aspect, a positioning device is provided, the device comprising:

The processing unit is used to determine the initial position of the terminal device.

The processing unit is also used to determine, based on the initial position of the terminal device, from multiple local point cloud maps a target local point cloud map that is closest to the initial position of the terminal device.

The communication unit is used to send the target local point cloud map to the terminal device.

In a fifth aspect, a terminal device is provided, comprising: a processor and a memory; the memory stores instructions executable by the processor; when the processor is configured to execute the instructions, the terminal device implements the method provided in the first aspect above.

In a sixth aspect, a server is provided, comprising: a processor and a memory; the memory stores instructions executable by the processor; when the processor is configured to execute the instructions, the server implements the method provided in the second aspect above.

In a seventh aspect, a computer-readable storage medium is provided, which stores computer instructions. When the computer instructions are executed on a computer, the computer executes the method provided in the first aspect, or executes the method provided in the second aspect.

In an eighth aspect, a computer program product comprising computer instructions is provided. When the computer instructions are executed on a computer, the computer executes the method provided in the first aspect, or executes the method provided in the second aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are used to provide further understanding of the technical solution of the present disclosure and constitute a part of the specification. Together with the embodiments of the present disclosure, they are used to explain the technical solution of the present disclosure and do not constitute a limitation on the technical solution of the present disclosure.

FIG1 is a schematic diagram of the composition of a positioning system according to some embodiments;

FIG2 is a functional schematic diagram of a terminal device and a server according to some embodiments;

FIG3 is a schematic flow chart of a positioning method according to some embodiments;

FIG4 is a schematic diagram of the composition of a positioning device according to some embodiments;

FIG5 is a schematic diagram of another positioning device according to some embodiments;

FIG6 is a schematic diagram of the structure of a terminal device according to some embodiments;

FIG. 7 is a schematic diagram of the structure of a server according to some embodiments.

Detailed ways

In order to enable those skilled in the art to better understand the technical solutions of the embodiments of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in the field without creative work are within the scope of protection of the present disclosure.

It should be noted that all directional indications in the embodiments of the present disclosure (such as up, down, left, right, front, back, etc.) are only used to explain the relative position relationship, movement status, etc. between the components under a certain specific posture (as shown in the accompanying drawings). If the specific posture changes, the directional indication will also change accordingly.

The terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the features. In the description of the present disclosure, unless otherwise specified, "plurality" means two or more.

In the description of the present disclosure, unless otherwise specified, "/" means "or", for example, A/B can mean A or B. The term "and/or" includes any and all combinations of one or more related listed items. For example, A and/or B includes only A, only B, and A and B.

In the description of the present disclosure, it should be noted that, unless otherwise clearly specified and limited, the terms "connected" and "connection" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection. For ordinary technicians in this field, the meanings of the above terms in the present disclosure can be understood according to the specific circumstances. In addition, when describing the pipeline, the "connected" and "connection" used in the present disclosure have the meaning of conduction. The specific meaning needs to be understood in conjunction with the context.

In the embodiments of the present disclosure, words such as "exemplarily" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplarily" or "for example" in the embodiments of the present disclosure should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as "exemplarily" or "for example" is intended to present related concepts in a specific way.

In order to facilitate those skilled in the art to understand the technical solution of the present disclosure, the terms involved in the present disclosure are briefly introduced below.

(1) Real-time communication on web pages

Web real-time communications (WebRTC) is a real-time communication technology that allows web applications or sites to establish peer-to-peer connections between browsers without the help of an intermediary, enabling the transmission of video streams and/or audio streams or any other data. Therefore, WebRTC enables users to create peer-to-peer data sharing and teleconferencing without installing any plug-ins or third-party software.

(2) Structure of motion recovery

Structure from motion (SFM) is an algorithm for 3D reconstruction. SFM extracts key points from multiple 2D images and performs image matching to calculate the image pose of the 2D image. It then performs 3D reconstruction based on the image pose and the 2D coordinates of the key points to obtain the corresponding 3D coordinate points.

3D reconstruction technology refers to an image processing technology that constructs the 3D structure of an object or scene in an image through multiple frames of 2D images. 3D reconstruction technology is usually used in augmented reality (AR), mixed reality (MR), visual positioning, autonomous driving, etc.

(3) Scale-invariant feature transformation

Scale-invariant feature transform (SIFT) is a description used in the field of image processing. This description is scale-invariant and can detect key points in the image. It is a local feature descriptor. SIFT features are based on some local appearance points of interest on the object and are independent of the size and rotation of the image. SIFT features have a high tolerance for changes in light, noise, and micro-perspective. The detection rate of partial object occlusion using SIFT feature description is also quite high, and even more than 3 SIFT object features are sufficient to calculate the position and orientation of the object.

(4) AR technology

AR technology is a technology that integrates virtual information with real information. It achieves "enhancement" of the real world by simulating virtual information such as computer-generated text and images and applying them to the real world.

AR navigation based on terminal devices is an important application of AR technology. AR navigation locates the user's current position and determines the navigation path. The navigation path is rendered into the real environment through AR, and real-time navigation information is added to the real road conditions to more intuitively guide the user forward, so that the user can get an immersive navigation experience. Due to the computationally intensive requirements of AR navigation and the limited computing power of terminal devices, the positioning of terminal devices in related technologies is completed by the server based on the current video frame image and point cloud map, that is, the terminal device completes the positioning of the terminal device through an external device. At present, there are certain limitations in the positioning methods for AR navigation, resulting in poor positioning accuracy of the terminal device when performing AR navigation in certain scenarios, or resulting in high positioning costs, which limits the scope of application of AR navigation based on terminal devices.

For example, the positioning method based on the global positioning system (GPS) has a higher positioning accuracy when performing AR navigation outdoors. However, when performing AR navigation indoors, indoor buildings will affect the GPS signal strength, resulting in a decrease in positioning accuracy. Therefore, the GPS-based positioning method cannot be applied to indoor AR navigation.

Another example is the positioning method based on multi-sensor fusion, which is to perform positioning by fusing data from sensors such as inertial sensors, lidar, and Bluetooth. Although this positioning method also has high positioning accuracy, when performing AR navigation indoors, indoor buildings will also affect the detection of sensors, thereby affecting the indoor positioning accuracy. The positioning accuracy during indoor AR navigation is poor, and the positioning cost is too high due to the addition of more sensors.

For example, the vision-based positioning method is based on the Simultaneous Localization and Mapping (SLAM) framework, and performs positioning through feature extraction, feature matching, pose solving and other main steps. The vision-based positioning method also has high positioning accuracy, but this positioning method has high requirements for computing power, and the computing power of the terminal device is limited. This positioning method cannot be applied to the terminal device, that is, it cannot be applied to the AR navigation based on the terminal device.

In summary, how to achieve high-precision positioning of terminal devices with the limited computing power of terminal devices is an urgent problem to be solved.

In order to solve the above problems, the present disclosure provides a positioning method, which obtains the target local point cloud map closest to the initial position of the terminal device from the server, and then the terminal device can complete the positioning of the terminal device based on the target local point cloud map and the current video frame image. It can be understood that the local point cloud map is small and has low computing power requirements, so the terminal device with limited computing power can complete the positioning of the terminal device based on the target local point cloud map and the current video frame image, realizing high-precision positioning of the terminal device with limited computing power of the terminal device.

The technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present disclosure.

FIG1 is a schematic diagram of the composition of a positioning system according to some embodiments. As shown in FIG1 , the present disclosure provides a positioning system. The positioning system may include a terminal device 100 and a server 200. The terminal device 100 and the server 200 may be connected via a wired network or a wireless network.

In some embodiments, the terminal device 100 may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an AR/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), etc. The disclosed embodiments do not limit the type of the terminal device 100.

In some embodiments, the terminal device 100 may be installed with clients of various application services, such as a browser, an instant messaging tool, etc. The user may operate the client of the application service, and the terminal device 100 interacts with the server 200 through a wired network or a wireless network in response to the user's operation to receive or send information. Among them, the browser may be a web browser that supports WebRTC functions and a portable, small, fast-loading, and highly compatible Web assembly (WASM).

In some embodiments, server 200 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), as well as big data and artificial intelligence platforms.

The server 200 can cooperate with the terminal device 100 to enable the user to use various application services. For example, the server 200 can analyze and process the information sent by the terminal device 100 and feed back the processing results to the terminal device 100.

In some embodiments, the server 200 is a server running a Web browser server, the terminal device 100 is a terminal device running a Web browser client, and there is a communication connection between the Web browser server and the Web browser client.

In some embodiments, the server 200 is a server with a display, and the display of the server is used to display a control interface of the server.

In some embodiments, the wired network or wireless network may include a router, a switch, a base station, or other devices that facilitate communication between the terminal device 100 and the server 200, which is not limited in the embodiments of the present disclosure.

It should be understood that the number of devices included in the positioning system shown in Figure 1 is not limited, for example, the number of terminal devices 100 is not limited, and the number of servers is not limited. In addition to the devices shown in Figure 1, the positioning system shown in Figure 1 may also include other devices, which is not limited in this disclosure.

FIG. 2 is a functional schematic diagram of a terminal device and a server according to some embodiments. As shown in FIG. 2 , in some embodiments of the present disclosure, the terminal device 100 and the server 200 are respectively used to implement the following functions:

The server 200 is used to restore the scale of the multiple frames of environmental images after acquiring the multiple frames of environmental images, and then store them in a database of the server 200 , which may be a database.db database.

The server 200 is also used to extract features from multiple frames of environmental images, generate a point cloud map, and store the point cloud map in a 3D point plus descriptor format.

The server 200 is also used to implement functions such as outlier removal, point cloud division, point cloud storage structure, point cloud compression, point cloud prediction, and point cloud map storage for the point cloud map. Among them, point cloud division is to divide the point cloud map into multiple local point cloud maps based on density or grid. The number of three-dimensional (3D) points and the number of feature descriptors are stored in the point cloud storage structure, wherein the correspondence between 3D points and feature descriptors is one-to-many. Point cloud compression is to remove ceiling and ground points from the point cloud map. Point cloud prediction is to determine the local point cloud map closest to the initial position of the terminal device 100 based on the initial position of the terminal device 100. The terminal device 100 is used to obtain the current video frame image, and send the current video frame image to the server 200 through the Web browser client to obtain the local point cloud map closest to the initial position of the terminal device 100 from the server 200, and store the obtained local point cloud map in the database.

Server 200 is a server running a Web browser server, so server 200 is also used for server positioning, that is, the Web browser server completes the positioning of the current position of the terminal device 100 according to the current video frame image sent by the terminal device 100, and stores the initial position of the terminal device 100 in the database.db database.

The terminal device 100 is also used to track the real-time position of the user during AR navigation based on the three-axis attitude angle (or angular velocity) and acceleration measured by the inertial measurement unit (IMU) in combination with the current video frame image, the previous video frame image and the local point cloud map.

FIG3 is a schematic flow chart of a positioning method according to some embodiments. As shown in FIG3 , the present disclosure provides a positioning method, which includes S101 to S105 .

S101. The server 200 determines an initial location of a terminal device.

In some embodiments, the server 200 receives a first request message from the terminal device 100. The first request message is used to request a local point cloud map, and the first request message includes a video frame image used to locate the initial position of the terminal device 100. The server 200 determines the initial position of the terminal device 100 according to the video frame image in the received first request message.

Exemplarily, when the user needs to use AR navigation, he can open the AR navigation function of the terminal device by inputting a link in a web browser installed on the terminal device 100. After receiving the user's input instruction, the terminal device 100 responds to the input instruction and issues a prompt message for prompting the user to use the terminal device 100 to scan the user's surrounding environment.

After receiving the scanning operation of the user, the terminal device 100 responds to the scanning operation and captures the user's surrounding environment to obtain a video frame image of the user's surrounding environment. After receiving the video frame image, the server 200 sends the first request information to the server 200. The server 200 receives the first request information from the terminal device 100, and determines the initial position of the terminal device 100 according to the video frame image in the received first request information.

S102 , the server 200 determines, based on the initial position of the terminal device 100 , from a plurality of local point cloud maps, a target local point cloud map that is closest to the initial position of the terminal device 100 .

In some embodiments, a plurality of local point cloud maps are pre-stored in the memory of the server 200, and each local point cloud map corresponds to a different coverage range. After determining the initial position of the terminal device 100, the server 200 can compare the distance between the initial position of the terminal device 100 and the center point of each local point cloud map, and determine the target local point cloud map whose center point is closest to the initial position of the terminal device 100 from the multiple local point cloud maps.

In some embodiments, the multiple local point cloud maps are obtained by optimizing the multiple initial local point cloud maps, and the multiple initial local point cloud maps are obtained by segmenting the point cloud maps.

The construction process of the point cloud map and multiple local point cloud maps is shown in S1 to S4.

S1. Image acquisition.

The image acquisition personnel can use a camera device to collect the multi-frame environment images that need to build AR navigation, and after collecting the multi-frame environment images that need to build AR navigation, upload the multi-frame environment images that need to build AR navigation to the server 200.

In some embodiments, the camera device may be a terminal device 100, and the terminal device 100 is equipped with (or carries, is equipped with) a panoramic camera, that is, the terminal device 100 and the panoramic camera are an integrated device. The terminal device 100 can use the panoramic camera to collect images of the environment through which the AR navigation passes. The image collector can hold the terminal device 100 to collect images of the environment through which the AR navigation passes, and transmit the collected multi-frame environmental images to the server 200 in real time through the terminal device 100.

In other embodiments, the camera device may be a panoramic camera, that is, the terminal device 100 and the panoramic camera are two independent physical devices. The terminal device 100 and the panoramic camera are connected by wired connection or wireless connection. The terminal device 100 obtains multiple frames of environmental images taken by the panoramic camera by wired connection or wireless connection, and then transmits the obtained multiple frames of environmental images to the server 200 by wired connection or wireless connection. The wireless connection may be a Bluetooth connection or a wireless network connection, and the present disclosure does not limit the connection method between the terminal device 100 and the panoramic camera.

S2. Point cloud map construction.

In some embodiments, after acquiring the multiple frames of environment images required for building AR navigation, the server 200 can perform three-dimensional reconstruction on the multiple frames of environment images to obtain a point cloud map.

In some embodiments, the server 200 processes multiple frames of environmental images using a three-dimensional reconstruction method based on a first feature extraction method to determine the pose information and first feature descriptor of each of the multiple three-dimensional points. Afterwards, the server 200 processes the multiple frames of environmental images using a second feature extraction method to determine the second feature descriptor of each of the multiple three-dimensional points. Furthermore, the server 200 constructs a point cloud map based on the pose information and second feature descriptor of each of the multiple three-dimensional points. Among them, the data volume of the second feature descriptor of each three-dimensional point is less than the data volume of the first feature descriptor of the same three-dimensional point. The pose information of a three-dimensional point includes the position and posture of the three-dimensional point in a specified coordinate system.

It should be noted that the feature descriptor is a representation of an image or an image block, which simplifies the image by extracting useful information and discarding redundant information. Usually, the feature descriptor converts an image of size The image of width×height×3 (number of channels) is converted into a feature vector/array of length n.

It can be understood that since the data amount of the second feature descriptor of each three-dimensional point is smaller than the data amount of the first feature descriptor of the same three-dimensional point, the server 200 constructs a point cloud map with the posture information and the second feature descriptor of each three-dimensional point among multiple three-dimensional points, and can quickly extract and match features of the feature descriptors of each three-dimensional point, thereby improving the construction speed of the point cloud map, so as to meet the real-time service requirements of AR navigation based on the terminal device 100.

In some embodiments, after constructing the point cloud map, the server 200 stores the point cloud map in a three-dimensional point plus descriptor format.

In some embodiments, the first feature extraction method is the SIFT feature extraction method, and the second feature extraction method is the oriented fast and rotated brief (ORB) feature extraction method.

In some embodiments, the first feature descriptor may be a SIFT feature descriptor or an AKAZE feature descriptor, and the second feature descriptor may be an ORB feature descriptor or a SUPERPOINT feature descriptor.

Exemplarily, taking the first feature descriptor as the SIFT feature descriptor and the second feature descriptor as the ORB feature descriptor as an example, the server 200 performs the first 3D reconstruction based on the SIFT features of the multi-frame environment image to obtain the pose information and SIFT feature descriptor of each of the multiple 3D points. Then, the server 200 performs the second 3D reconstruction based on the ORB features of the multi-frame image to obtain the ORB feature descriptor of each of the multiple 3D points. Furthermore, the server 200 constructs a point cloud map based on the pose information and ORB feature descriptor of each of the multiple 3D points.

It is understandable that although the SIFT feature descriptors of multiple three-dimensional points have strong descriptive ability and high accuracy, the server 200 takes too long to perform feature extraction and feature matching on the SIFT feature descriptors of three-dimensional points, and cannot meet the real-time service requirements of the AR navigation based on the terminal device 200. The data volume of the ORB feature descriptor of a three-dimensional point is smaller than the data volume of the SIFT feature descriptor of the same three-dimensional point. Therefore, the server 200 constructs a point cloud map based on the pose information of each three-dimensional point in the multiple three-dimensional points and the ORB feature descriptor, which improves the feature extraction speed and feature matching speed of the feature descriptors of the three-dimensional points, and improves the construction speed of the point cloud map, so that the real-time service requirements of the AR navigation based on the terminal device 100 can be met.

In the above embodiment, the server 200 performs three-dimensional reconstruction based on the SFM method to obtain a point cloud map. For example, the server 200 can also perform three-dimensional reconstruction based on the SLAM technology to obtain a point cloud map. For example, the server 200 can perform three-dimensional reconstruction based on the ORB-SLAM technology to obtain a point cloud map.

S3. Segment the point cloud map to obtain multiple initial local point cloud maps.

In some embodiments, after the point cloud map is constructed, the server 200 can segment the point cloud map to obtain multiple initial local point cloud maps.

In some embodiments, the server 200 segments the point cloud map to obtain multiple initial local point cloud maps, which can be implemented as follows: after generating the point cloud map, the server 200 displays the point cloud map on a display to realize visualization of the point cloud map, so that the user can set map points of interest for the point cloud map according to the point cloud map displayed on the display of the server 200. After receiving the user's operation of setting map points of interest for the point cloud map, the server 200 responds to the operation and constructs a geographic fence of a certain size based on each map point of interest, and segments the point cloud map to obtain multiple initial local point cloud maps. Among them, the map points of interest are also the key points in the AR navigation process.

S4. Optimize the multiple initial local point cloud maps to obtain multiple local point cloud maps.

The optimization process includes at least one of the following: removing outlier three-dimensional points in the initial local point cloud map; or converting a one-to-many relationship between three-dimensional points and feature descriptors in the initial local point cloud map into a one-to-one relationship.

In some embodiments, the server 200 may use a statistical filtering method to remove outlier three-dimensional points in the initial local point cloud map.

It is understandable that the presence of outlier 3D points in the initial local point cloud map will affect the accuracy of positioning. In order to avoid the occurrence of erroneous positioning caused by the presence of outlier 3D points in the initial local point cloud map, the server 200 can remove the outlier 3D points in the initial local point cloud map.

The initial local point cloud map is obtained by processing multiple frames of environmental images. Different frames of environmental images are obtained by shooting the same environment under different conditions. The different conditions include different angles, different times, and different locations. Therefore, a 3D point in the initial local point cloud map can have a corresponding relationship with multiple feature descriptors. However, if a 3D point in the initial local point cloud map has a corresponding relationship with multiple feature descriptors, the feature matching time will be too long, which cannot meet the real-time service requirements of AR navigation based on terminal devices.

Based on this, the server 200 can convert the one-to-many relationship between the three-dimensional points and the feature descriptors in the initial local point cloud map into a one-to-one relationship to improve the feature matching speed, so as to meet the real-time service requirements of Web-based AR navigation.

It should be noted that converting the one-to-many relationship between the three-dimensional points and the feature descriptors in the initial local point cloud map into a one-to-one relationship can be understood as deduplication of the feature descriptors of the initial local point cloud map.

S103. The server 200 sends the target local point cloud map to the terminal device 100.

In some embodiments, after the server 200 determines the target local point cloud map that is closest to the initial position of the terminal device 100 based on the initial position of the terminal device 100, it sends the target local point cloud map to the terminal device 100 so that the terminal device 100 can implement the AR navigation function based on the target local point cloud map.

Exemplarily, the server 200 sends the target local point cloud map in the binary file BIN format to the terminal device 100 through the byte stream. In this way, on the one hand, it can speed up the data transmission while ensuring data communication, which helps to meet the real-time requirements of Web-based AR navigation; on the other hand, the target local point cloud map in BIN format occupies less storage space of the terminal device 100, which can save the storage space of the terminal device 100 and enable the terminal device 100 to store more local point cloud maps, so that the terminal device 100 can use the stored local point cloud as an offline map when the mobile network is turned off, ensuring the user experience.

In some embodiments, after obtaining multiple local point cloud maps, the server 200 can convert the format of each local point cloud map into BIN format for storage, so that when the server 200 subsequently receives the first request information sent from the terminal device 100 for requesting the local point cloud map, the server 200 does not need to convert the format of the local point cloud map into BIN format in real time, and can instantly send the local point cloud map in BIN format to the terminal device 100 through a byte stream, thereby improving the response speed of the server 200 to the request information of the terminal device 100, and helping to meet the real-time requirements of Web-based AR navigation.

S104. The terminal device 100 obtains the target local point cloud map from the server 200.

The target local point cloud map is the local point cloud map that is closest to the initial position of the terminal device 100 among the multiple local point cloud maps.

In some embodiments, after receiving the user's instruction to turn on the AR navigation function, the terminal device 100 sends a first request message to the server 200 in response to the instruction. The first request message is used to request a target local point cloud map, and the first request message includes a video frame image for locating the initial position. Then, the terminal device 100 receives a first response message from the server 200, and the first response message includes a target local point cloud map.

In some embodiments, after the terminal device 100 receives the target local point cloud map in BIN format from the server 200, the terminal device 100 reads the target local point cloud map in BIN format, determines the geo-fence information of the target local point cloud map, sets the center point according to the geo-fence information of the target local point cloud map, and loads the center point to the Web browser client and stores it in the JavaScript data structure of the Web browser client.

S105. After moving into the coverage of the target local point cloud map, determine the position and posture information corresponding to the current video frame image based on the current video frame image and the target local point cloud map.

In some embodiments, after the terminal device 100 obtains the target local point cloud map, the terminal device 100 can perform real-time positioning based on the video stream of the surrounding environment captured in real time, and determine whether to move into the coverage of the target local point cloud map in combination with the coverage of the target local point cloud map.

Determining whether to move into the coverage of the target local point cloud map may include the following situations:

Scenario 1: The server 200 determines whether the object has moved into the coverage of the target local point cloud map.

In some embodiments, the terminal device 100 can intercept in real time based on WebRTC technology the video stream of the user's surrounding environment shot by the terminal device 100 to obtain a real-time video frame image of the user's surrounding environment, and then send the real-time video frame image of the user's surrounding environment to the server 200, which locates the video frame image of the user's surrounding environment to obtain the current position of the terminal device 100.

Then, the server 200 calculates the distance between the current position of the terminal device 100 and the center point of the target local point cloud map in real time. If it is detected that the distance between the current position of the terminal device 100 and the center point of the target local point cloud map is less than or equal to the inner radius of the local point cloud map, it is determined that the terminal device 100 moves into the coverage range of the local point cloud map. After determining that the terminal device 100 moves into the coverage range of the local point cloud map, the terminal device 100 receives a second response message from the server 200, and the second response message is used to indicate that the terminal device 100 moves into the coverage range of the target local point cloud map.

Scenario 2: The terminal device 100 determines whether it has moved into the coverage of the target local point cloud map.

In some embodiments, the terminal device 100 can intercept the video stream of the user's surrounding environment shot by the terminal device 100 in real time based on the WebRTC technology, obtain the real-time video frame image of the user's surrounding environment, and then locate the real-time video frame image of the user's surrounding environment to obtain the current position of the terminal device 100, and then calculate the distance between the current position of the terminal device 100 and the center point of the target local point cloud map in real time. If it is detected that the distance between the current position of the terminal device 100 and the center point of the target local point cloud map is less than or equal to the inner circle radius range of the local point cloud map, it is determined that the terminal device 100 moves into the coverage range of the local point cloud map.

After determining the coverage range of the target local point map, the terminal device 100 can determine the posture information corresponding to the current video frame image based on the current video frame image and the target local point cloud map.

In some embodiments, the terminal device 100 determines the posture information corresponding to the current video frame image based on the current video frame image and the target local point cloud map, which can be implemented as X1 to X3.

X1. Perform feature extraction on the current video frame image to obtain a feature descriptor of the current video frame image.

Exemplarily, the terminal device 100 may perform ORB feature extraction on the current video frame image to obtain an ORB feature descriptor of the current video frame image.

In some embodiments, in order to increase the rate of feature extraction for the current video frame image to meet the real-time requirements of navigation, the terminal device 100 may use JsFeat technology to accelerate when performing ORB feature extraction on the current video frame image.

X2. Perform feature matching on the feature descriptor of the current video frame image and the feature descriptor of the three-dimensional point in the target local point cloud map to determine the matched three-dimensional point.

In some embodiments, in order to improve the quality of the matched three-dimensional points, the terminal device 100 can perform feature matching on the feature descriptors of the current video frame image and the feature descriptors of the three-dimensional points in the local point cloud map based on the WebAssembly component to determine the matched three-dimensional points.

X3. Based on the posture information of the matched three-dimensional points, obtain the posture information corresponding to the current video frame image.

In some embodiments, after the terminal device 100 determines the matched three-dimensional point, it can solve the posture information of the matched three-dimensional point to obtain the posture information corresponding to the current video frame image. The posture information corresponding to the current video frame image is the posture of the terminal device 100 relative to the real object (world coordinate system) when the current video frame image is shot, and the posture information includes the position and posture of the terminal device 100 in the world coordinate system.

In some embodiments, in order to improve the accuracy of the posture information corresponding to the current video frame image, the terminal device 100 can also perform posture solution on the posture information of the matched three-dimensional points based on the WebAssembly component to obtain the posture information corresponding to the current video frame image.

In the disclosed embodiment, the terminal device 100 completes the positioning of the terminal device 100 according to the target local point cloud map and the current video frame image obtained from the server 200. It can be understood that the point cloud map is larger than the local point cloud map, and has higher requirements for computing power, and is not suitable for the terminal device 100 with limited computing power. The local point cloud map is smaller than the point cloud map, and has lower requirements for computing power. The terminal device 100 can complete the positioning of the terminal device 100 according to the target local point cloud map and the current video frame image, and realizes the positioning of the terminal device 100 with higher precision using the limited computing power of the terminal device 100. Moreover, the terminal device 100 uses the local point cloud map and the current video frame image to locate the terminal device 100, and is less affected by indoor buildings, so that the AR navigation based on the terminal device 100 can be applied to indoor AR navigation, which improves the scope of application of the AR navigation based on the terminal device 100.

In some embodiments, a positioning method provided by an embodiment of the present disclosure can, after determining the posture information corresponding to the current video frame image, generate a navigation trajectory based on the posture information of the video frame images at multiple moments. Exemplarily, the navigation trajectory generation process can include A1 to A3.

A1. The terminal device 100 obtains a posture sequence.

The pose sequence includes pose information corresponding to video frame images at multiple moments.

In some embodiments, when the terminal device 100 turns on the AR navigation function, the terminal device 100 can intercept the video stream of the user's surrounding environment shot by the terminal device 100 over a period of time based on the WebRTC technology, and obtain video frame images of the user's surrounding environment at multiple moments, and then perform S105 processing on the video frame images at multiple moments to obtain the posture information of the video frame images at multiple moments, that is, to obtain a posture sequence.

A2. The terminal device 100 performs interpolation processing and filtering processing on the posture sequence to obtain a posture sequence after interpolation processing and filtering processing.

It is understandable that in the process of the terminal device 100 photographing the user's surroundings, that is, in the image acquisition process, there may be missed acquisition or erroneous acquisition, resulting in the acquired pose sequence not being smooth enough. In order to reduce the impact of missed acquisition or erroneous acquisition on the smoothness of the pose sequence during the image acquisition process, the pose sequence can be interpolated and filtered so that the pose sequence after interpolation and filtering has a good degree of smoothness.

A3. The terminal device 100 displays the navigation trajectory based on the posture sequence after interpolation and filtering.

In some embodiments, after the terminal device 100 obtains the pose sequence after interpolation and filtering, the terminal device 100 generates a navigation trajectory based on the pose sequence after interpolation and filtering and the target local point cloud map, and displays the navigation trajectory.

In some embodiments, the terminal device 100 displays the navigation trajectory based on the posture sequence after interpolation and filtering, which can be implemented as follows: the terminal device 100 displays the navigation trajectory through a web browser client based on the posture sequence after interpolation and filtering.

In some embodiments, while the terminal device 100 is displaying a navigation trajectory, the terminal device 100 updates the current position of the terminal device 100 in real time based on the current video frame image in the video stream of the user's surrounding environment captured by the terminal device 100, and updates the navigation trajectory in real time in combination with the local point cloud map and the previous video frame image before the current video frame image to achieve a tracking effect.

In the disclosed embodiment, after obtaining the posture sequence, the terminal device 100 performs interpolation and filtering on the posture sequence, so that the navigation trajectory obtained based on the posture sequence after interpolation and filtering is smoother. In the process of the terminal device 100 displaying the navigation trajectory, the terminal device 100 updates the navigation trajectory in real time according to the previous video frame image and the local point cloud map, thereby achieving a tracking effect for the user, which helps to improve the user experience.

The above mainly introduces the solution provided by the embodiment of the present disclosure from the perspective of the method. In order to achieve the above functions, the positioning device includes a hardware structure and/or software module corresponding to the execution of each function. It should be easily appreciated by those skilled in the art that, in combination with the units and algorithm steps of each example described in the embodiment disclosed herein, the present disclosure can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed in the form of hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to exceed the scope of the present disclosure.

FIG4 is a schematic diagram of the composition of a positioning device according to some embodiments. As shown in FIG4, the present disclosure provides a positioning device 2000. The positioning device 2000 may be the terminal device 100, or a functional module in the terminal device 100, or any electronic device connected to the terminal device 100, etc., which is not limited by the present disclosure. The positioning device 2000 includes a communication unit 2001 and a processing unit 2002. In some embodiments, the positioning device 2000 may also include a storage unit 2003.

In some embodiments, the communication unit 2001 is used to obtain a target local point cloud map from the server 200 , where the target local point cloud map is a local point cloud map that is closest to an initial position of the terminal device 100 among multiple local point cloud maps.

The processing unit 2002 is used to determine the posture information corresponding to the current video frame image based on the current video frame image and the target local point cloud map after moving into the coverage area of the target local point cloud map.

In some embodiments, each local point cloud map is obtained by optimizing an initial local point cloud map; the optimization process includes at least one of the following: removing outlier three-dimensional points in the initial local point cloud map; or, converting a one-to-many relationship between three-dimensional points and feature descriptors in the initial local point cloud map into a one-to-one relationship.

In some embodiments, the initial local point cloud map is segmented from the point cloud map, and the point cloud map is constructed by: processing multiple frames of environmental images with a three-dimensional reconstruction method based on a first feature extraction method to determine the position information and the first feature descriptor of each of the multiple three-dimensional points; and using a second feature extraction method to extract the position information and the first feature descriptor of each of the three-dimensional points. Process multiple frames of environmental images to determine a second feature descriptor for each of the multiple three-dimensional points; wherein the data volume of the second feature descriptor for the three-dimensional point is less than the data volume of the first feature descriptor for the same three-dimensional point; and construct a point cloud map based on the pose information and the second feature descriptor for each of the multiple three-dimensional points.

In some embodiments, the communication unit is further used to: send a first request message to the server 200, the first request message is used to request a target local point cloud map, and the first request message includes a video frame image for locating an initial position.

The communication unit is used to receive a first response message from the server 200, where the first response message includes a target local point cloud map.

In some embodiments, the storage unit 2003 is used to store a local point cloud map.

In some embodiments, the communication unit 2001 is used to receive the target local point cloud map in a binary file format sent by the server 200.

In some embodiments, the processing unit is used to: extract features from the current video frame image to obtain a feature descriptor of the current video frame image; perform feature matching between the feature descriptor of the current video frame image and the feature descriptor of the three-dimensional points in the local point cloud map to determine the matched three-dimensional points; and obtain the pose information corresponding to the current video frame image based on the pose information of the matched three-dimensional points.

In some embodiments, the communication unit 2001 is further used to obtain a posture sequence, where the posture sequence includes posture information corresponding to video frame images at multiple moments.

The processing unit 2002 is further used to: perform interpolation processing and filtering processing on the posture sequence to obtain the posture sequence after interpolation processing and filtering processing; and display the navigation trajectory based on the posture sequence after interpolation processing and filtering processing.

In some embodiments, the processing unit 2002 is used to display the navigation trajectory through a World Wide Web web browser client based on the posture sequence after interpolation processing and filtering processing.

FIG5 is a schematic diagram of another positioning device according to some embodiments. As shown in FIG5, the present disclosure provides a positioning device 3000. The positioning device 3000 may be the server 200, or a functional module in the server 200, or any electronic device connected to the server 200, etc., which is not limited by the present disclosure. The positioning device 3000 includes a communication unit 3001 and a processing unit 3002. In some embodiments, the positioning device 3000 may also include a storage unit 3003.

In some embodiments, the processing unit 3002 is used to determine an initial position of the terminal device 100 .

The processing unit 3002 is further used to determine, based on the initial position of the terminal device 100 , from a plurality of local point cloud maps, a target local point cloud map that is closest to the initial position of the terminal device 100 .

The communication unit 3001 is used to send the target local point cloud map to the terminal device 100.

In some embodiments, the processing unit 3002 is also used to: segment the point cloud map to obtain multiple initial local point cloud maps; optimize the multiple initial local point cloud maps respectively to obtain multiple local point cloud maps; wherein the optimization processing includes at least one of the following: removing outlier three-dimensional points in the initial local point cloud map; or, converting a one-to-many relationship between the three-dimensional points in the initial local point cloud map and the feature descriptors into a one-to-one relationship.

In some embodiments, the processing unit 3002 is further used to: process the multiple frames of environmental images using a 3D reconstruction method based on a first feature extraction method to determine the position information and the first feature descriptor of each of the multiple 3D points; process the multiple frames of environmental images using a second feature extraction method to determine the second feature descriptor of each of the multiple 3D points; wherein the amount of data of the second feature descriptor of the 3D point is less than that of the same 3D point. A point cloud map is constructed based on the pose information and the second feature descriptor of each 3D point in the multiple 3D points.

In some embodiments, the first feature descriptor is a scale-invariant feature transform (SIFT) feature descriptor, and the second feature descriptor is an ORB feature descriptor.

In some embodiments, the communication unit 3001 is used to receive first request information from the terminal device 100 , where the first request information is used to request a local point cloud map, and the first request information includes a video frame image for locating an initial position of the terminal device 100 .

The processing unit 3002 is used to determine the initial position of the terminal device 100 according to the video frame image used to locate the initial position of the terminal device 100.

In some embodiments, the communication unit 3001 is used to send a target local point cloud map in a binary file format to the terminal device 100 .

In some embodiments, the above method is applicable to a server running a Web browser server, and there is a communication connection between the Web browser server and a Web browser client running on the terminal device 100.

In some embodiments, the storage unit 3003 is used to store the point cloud map.

The units in FIG. 4 and FIG. 5 may also be referred to as modules. For example, a processing unit may be referred to as a processing module.

If the various units in FIG. 4 and FIG. 5 are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiment of the present disclosure is essentially or the part that contributes to the relevant technology or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) or a processor (processor) to perform all or part of the steps of the method described in each embodiment of the present disclosure. The storage medium for storing computer software products includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc. Various media that can store program codes.

FIG6 is a schematic diagram of the structure of a terminal device according to some embodiments. As shown in FIG6, in the case of implementing the functions of the above-mentioned integrated modules in the form of hardware, the present disclosure provides a terminal device, terminal device 4000, which may be the above-mentioned positioning device 2000. The terminal device 4000 includes: a processor 4002, a communication interface 4003, and a bus 4004. In some embodiments, the terminal device 4000 may also include a memory 4001.

FIG7 is a schematic diagram of the structure of a server according to some embodiments. As shown in FIG7 , in the case of implementing the functions of the above-mentioned integrated modules in the form of hardware, the present disclosure provides a server 5000, and the server 5000 may be the above-mentioned positioning device 3000. The server 5000 includes: a processor 5002, a communication interface 5003, and a bus 5004. In some embodiments, the server 5000 may also include a memory 5001.

It should be noted that processor 4002 and processor 5002 can both implement or execute various exemplary logic blocks, modules and circuits described in conjunction with the present disclosure. Processor 4002 and processor 5002 can both be central processing units, general-purpose processors, digital signal processors, application-specific integrated circuits, field programmable gate arrays or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. Processor 4002 and processor 5002 can implement or execute various exemplary logic blocks, modules and circuits described in conjunction with the present disclosure. The processor 5002 can also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of DSPs and microprocessors, etc.

Communication interface 4003 and communication interface 5003 are both used to connect to other devices through a communication network. Communication network 4003 and communication interface 5003 can be Ethernet, wireless access network, wireless local area network (WLAN), etc.

Memory 4001 and memory 5001 may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or electrically erasable programmable read-only memory (EEPROM), disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and can be accessed by a computer, but are not limited to these.

In some embodiments, the memory 4001 may exist independently of the processor 4002, and the memory 4001 may be connected to the processor 4002 via a bus 4004 for storing instructions or program codes. When the processor 4002 calls and executes the instructions or program codes stored in the memory 4001, the positioning method provided in the embodiment of the present disclosure can be implemented. The memory 5001 may exist independently of the processor 5002, and the memory 5001 may be connected to the processor 5002 via a bus 5004 for storing instructions or program codes. When the processor 5002 calls and executes the instructions or program codes stored in the memory 5001, the positioning method provided in the embodiment of the present disclosure can be implemented.

In other embodiments, the memory 4001 may be integrated with the processor 4002 , and the memory 5001 may be integrated with the processor 5002 .

Bus 4004 and bus 5004 may be extended industry standard architecture (EISA) buses, etc. Bus 4004 and bus 5004 may be divided into address buses, data buses, control buses, etc. For ease of representation, only one thick line is used in both FIG6 and FIG7 , but this does not mean that there is only one bus or one type of bus.

Through the description of the above implementation methods, technical personnel in the relevant field can clearly understand that for the convenience and simplicity of description, only the division of the above-mentioned functional modules is used as an example. In actual applications, the above-mentioned functions can be assigned to different functional modules as needed, that is, the internal structure of the positioning device can be divided into different functional modules to complete all or part of the functions described above.

The disclosed embodiment also provides a computer-readable storage medium. All or part of the processes in the above method embodiments can be completed by computer instructions to instruct the relevant hardware, and the program can be stored in the above computer-readable storage medium. When the program is executed, it can include the processes of the above method embodiments. The computer-readable storage medium can be the memory or memory of any of the above embodiments. The above computer-readable storage medium can also be an external storage device of the above positioning device, such as a plug-in hard disk, a smart memory card (smart media card, SMC), a secure digital (secure digital, SD) card, a flash card (flash card), etc. equipped on the above positioning device. Further, the above computer-readable storage medium can also include both the internal storage unit of the above positioning device and an external storage device. The above computer-readable storage medium is used to store the above computer program and other programs and data required by the above positioning device. The above computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.

The embodiments of the present disclosure also provide a computer program product, which includes a computer program. When the computer program product is run on a computer, the computer is enabled to execute any one of the positioning methods provided in the above embodiments.

Although the present disclosure is described herein in conjunction with various embodiments, in practicing the claimed invention, During the development process, those skilled in the art may understand and implement other changes to the disclosed embodiments by reviewing the drawings, the disclosed content, and the appended claims. In the claims, the word "comprising" does not exclude other components or steps, and "one" or "an" does not exclude multiple situations. A single processor or other unit can implement several functions listed in the claims. Certain measures are recorded in different dependent claims, but this does not mean that these measures cannot be combined to produce good results.

Although the present disclosure has been described in conjunction with specific features and embodiments thereof, it is apparent that various modifications and combinations may be made thereto without departing from the spirit and scope of the present disclosure. Accordingly, this specification and the drawings are merely exemplary illustrations of the present disclosure as defined by the appended claims, and are deemed to have covered any and all modifications, variations, combinations or equivalents within the scope of the present disclosure. Obviously, those skilled in the art may make various modifications and variations to the present disclosure without departing from the spirit and scope of the present disclosure. Thus, if these modifications and variations of the present disclosure fall within the scope of the claims of the present disclosure and their equivalents, the present disclosure is also intended to include these modifications and variations.

The above are only specific implementations of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the present application should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A positioning method, comprising:

Acquire a target local point cloud map from a server, wherein the target local point cloud map is a local point cloud map that is closest to an initial position of the terminal device among multiple local point cloud maps;

After moving into the coverage of the target local point cloud map, the position and posture information corresponding to the current video frame image is determined based on the current video frame image and the target local point cloud map.
The method according to claim 1, wherein each of the local point cloud maps is obtained by optimizing an initial local point cloud map; wherein the optimization process comprises one or more of the following:

Removing outlier three-dimensional points in the initial local point cloud map; or,

The one-to-many relationship between the three-dimensional points and the feature descriptors in the initial local point cloud map is converted into a one-to-one relationship.
The method according to claim 2, wherein the initial local point cloud map is segmented from a point cloud map, and the point cloud map is constructed by:

Processing the multiple frames of environment images using a three-dimensional reconstruction method based on the first feature extraction method to determine the position and posture information of each three-dimensional point in the multiple three-dimensional points and a first feature descriptor;

Processing the multiple frames of environmental images in a second feature extraction manner to determine a second feature descriptor for each of the multiple three-dimensional points; wherein the data volume of the second feature descriptor of the three-dimensional point is less than the data volume of the first feature descriptor of the same three-dimensional point;

The point cloud map is constructed based on the position and posture information of each of the three-dimensional points and the second feature descriptor.
The method according to claim 3, wherein the first feature descriptor is a scale-invariant feature transform (SIFT) feature descriptor, and the second feature descriptor is an ORB feature descriptor.
The method according to any one of claims 1 to 4, wherein the method further comprises:

Sending a first request message to the server, where the first request message is used to request the target local point cloud map, and the first request message includes a video frame image used to locate the initial position;

The step of obtaining a target local point cloud map from a server includes:

A first response message is received from the server, where the first response message includes the target local point cloud map.
The method according to any one of claims 1 to 4, wherein obtaining the target local point cloud map from the server comprises:

Receive the target local point cloud map in a binary file format sent by the server.
The method according to any one of claims 1 to 4, wherein determining the pose information corresponding to the current video frame image based on the current video frame image and the target local point cloud map comprises:

Extracting features of the current video frame image to obtain a feature descriptor of the current video frame image;

Performing feature matching on the feature descriptor of the current video frame image and the feature descriptor of the three-dimensional point in the target local point cloud map to determine the matched three-dimensional point;

Based on the matched position and posture information of the three-dimensional point, the position and posture information corresponding to the current video frame image is obtained.
The method according to any one of claims 1 to 4, wherein the method further comprises:

Acquire a posture sequence, wherein the posture sequence includes posture information corresponding to video frame images at multiple moments;

Performing interpolation processing and filtering processing on the posture sequence to obtain a posture sequence after interpolation processing and filtering processing;

Based on the interpolated and filtered pose sequence, a navigation trajectory is displayed.
The method according to claim 8, wherein displaying the navigation trajectory based on the interpolated and filtered pose sequence comprises:

Based on the interpolated and filtered pose sequence, the navigation track is displayed through a World Wide Web web browser client.
A positioning method, comprising:

Determine the initial location of the terminal device;

According to the initial position of the terminal device, determining a target local point cloud map which is closest to the initial position of the terminal device from a plurality of local point cloud maps;

Send the target local point cloud map to the terminal device.
The method according to claim 10, wherein the method further comprises:

Segment the point cloud map to obtain multiple initial local point cloud maps;

Optimizing the multiple initial local point cloud maps respectively to obtain the multiple local point cloud maps;

The optimization process includes one or more of the following:

Removing outlier three-dimensional points in the initial local point cloud map; or,

The one-to-many relationship between the three-dimensional points and the feature descriptors in the initial local point cloud map is converted into a one-to-one relationship.
The method according to claim 11, wherein the method further comprises:

Processing the multiple frames of environment images using a three-dimensional reconstruction method based on the first feature extraction method to determine the position and posture information of each three-dimensional point in the plurality of three-dimensional points and a first feature descriptor;

Processing the multiple frames of environmental images in a second feature extraction manner to determine a second feature descriptor for each of the multiple three-dimensional points; wherein the data volume of the second feature descriptor of the three-dimensional point is less than the data volume of the first feature descriptor of the same three-dimensional point;

The point cloud map is constructed based on the position and posture information of each of the three-dimensional points and the second feature descriptor.
The method according to claim 12, wherein the first feature descriptor is a scale-invariant feature transform (SIFT) feature descriptor, and the second feature descriptor is an ORB feature descriptor.
The method according to any one of claims 10 to 13, wherein determining the initial position of the terminal device comprises:

Receiving first request information from the terminal device, the first request information is used to request a local point cloud map, and the first request information includes a video frame image used to locate an initial position of the terminal device;

The initial position of the terminal device is determined according to the video frame image used to locate the initial position of the terminal device.
The method according to any one of claims 10 to 13, wherein the sending the target local point cloud map to the terminal device comprises:

The target local point cloud map is sent to the terminal device in a binary file format.
The method according to any one of claims 10 to 13, wherein the method is applicable to a server running a Web browser server, and there is a communication connection between the Web browser server and the Web browser client running on the terminal device.
A terminal device comprises: a processor and a memory;

The memory stores instructions executable by the processor;

When the processor is configured to execute the instructions, the terminal device implements the method according to any one of claims 1 to 9.
A server comprises: a processor and a memory;

The memory stores instructions executable by the processor;

When the processor is configured to execute the instructions, the server implements the method according to any one of claims 10 to 16.
A computer-readable storage medium, comprising computer instructions, which, when executed on a computer, enable the computer to execute the method according to any one of claims 1 to 9, or the method according to any one of claims 10 to 16.