CN114187344A

CN114187344A - Map construction method, device and equipment

Info

Publication number: CN114187344A
Application number: CN202111348552.8A
Authority: CN
Inventors: 秦延文; 李佳宁; 毛慧; 浦世亮
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-03-15

Abstract

The application provides a map construction method, a map construction device and map construction equipment, wherein the method comprises the following steps: acquiring a panoramic image of a target scene, and generating a first pinhole image corresponding to a first virtual camera based on the panoramic image; determining a rotation matrix between a target pose and an initial pose of a second virtual camera, determining a reference matrix between the target pose and the initial pose based on the rotation matrix; determining a second pinhole image corresponding to a second virtual camera based on the rotation matrix, selecting two-dimensional feature points corresponding to the actual position of the target scene from the first pinhole image and the second pinhole image, and determining three-dimensional map points corresponding to the actual position based on the two-dimensional feature points and the external reference matrix; constructing a three-dimensional visual map of the target scene based on the plurality of three-dimensional map points of the target scene. By the technical scheme, the terminal equipment of the target scene can be globally positioned based on the three-dimensional visual map, and the terminal equipment can be accurately positioned.

Description

Map construction method, device and equipment

Technical Field

The present application relates to the field of computer vision, and in particular, to a map construction method, apparatus, and device.

Background

The GPS (Global Positioning System) is a high-precision radio navigation Positioning System based on artificial earth satellites, and can provide accurate geographic position, vehicle speed and time information anywhere in the world and in the near-earth space. The Beidou satellite navigation system consists of a space section, a ground section and a user section, can provide high-precision, high-reliability positioning, navigation and time service for users all day long in the global range, and has regional navigation, positioning and time service capabilities.

Because the terminal equipment is provided with the GPS or the Beidou satellite navigation system, the GPS or the Beidou satellite navigation system can be adopted to position the terminal equipment when the terminal equipment needs to be positioned. Under the outdoor environment, because GPS signal or big dipper signal are better, can adopt GPS or big dipper satellite navigation system to carry out accurate positioning to terminal equipment. However, in an indoor environment, the GPS or beidou satellite navigation system cannot accurately position the terminal device because the GPS signal or beidou signal is poor. For example, in energy industries such as coal, electric power, petrochemical industry, and the like, the positioning needs are more and more, and these positioning needs are generally in indoor environments, and due to the problems such as signal shielding, accurate positioning of terminal equipment cannot be performed.

Disclosure of Invention

The application provides a map construction method, which comprises the following steps:

acquiring a panoramic image of a target scene, and generating a first pinhole image corresponding to a first virtual camera based on the panoramic image; the position of the first virtual camera is the sphere center position of a visual spherical coordinate system, and the initial posture of the first virtual camera is any posture taking the sphere center position as the center;

determining a rotation matrix between a target pose of a second virtual camera and the initial pose, and determining a reference matrix between the target pose and the initial pose based on the rotation matrix; the position of the second virtual camera is the sphere center position of the visual spherical coordinate system, and the target posture is obtained by rotating the initial posture around the coordinate axis of the visual spherical coordinate system;

determining a second pinhole image corresponding to a second virtual camera based on the rotation matrix, selecting two-dimensional feature points corresponding to the actual position of the target scene from the first pinhole image and the second pinhole image, and determining three-dimensional map points corresponding to the actual position based on the two-dimensional feature points and the external reference matrix;

a three-dimensional visual map of a target scene is constructed based on a plurality of three-dimensional map points of the target scene.

The present application provides a map construction apparatus, the apparatus including:

the acquisition module is used for acquiring a panoramic image of a target scene;

the generating module is used for generating a first pinhole image corresponding to the first virtual camera based on the panoramic image; the position of the first virtual camera is the sphere center position of a visual spherical coordinate system, and the initial posture of the first virtual camera is any posture taking the sphere center position as the center;

a determination module to determine a rotation matrix between a target pose of a second virtual camera and the initial pose and to determine a parameterization matrix between the target pose and the initial pose based on the rotation matrix; the position of the second virtual camera is the sphere center position of the visual spherical coordinate system, and the target posture is obtained by rotating the initial posture around the coordinate axis of the visual spherical coordinate system;

the generating module is further configured to determine a second pinhole image corresponding to the second virtual camera based on the rotation matrix; the determining module is further configured to select a two-dimensional feature point corresponding to an actual position of the target scene from the first pinhole image and the second pinhole image, and determine a three-dimensional map point corresponding to the actual position based on the two-dimensional feature point and the external parameter matrix; and constructing a three-dimensional visual map of the target scene based on the plurality of three-dimensional map points of the target scene.

The present application provides a map building apparatus, including: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is used for executing machine executable instructions to realize the map construction method of the embodiment of the application.

According to the technical scheme, the three-dimensional visual map of the target scene can be constructed, the terminal device of the target scene is globally positioned based on the three-dimensional visual map, the terminal device is accurately positioned, the target scene can be an indoor environment, an indoor positioning function based on vision is realized, the method can be applied to energy industries such as coal, electric power and petrochemical industry, the indoor positioning of personnel (such as workers and inspection personnel) is realized, the personnel position information is rapidly acquired, the personnel safety is guaranteed, and the efficient management of the personnel is realized. The three-dimensional visual map can be constructed by adopting the panoramic image of the target scene, the panoramic image has a larger field angle, repeated data acquisition of the target scene is avoided, and the data acquisition efficiency is improved. When the three-dimensional map points are determined, the small hole images can be obtained in a projection mode of the virtual camera, pose constraints of the virtual camera are added, the three-dimensional map points are determined by adopting a virtual camera binding optimization strategy, and the map building robustness, the map building efficiency and the map building precision are improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a mapping method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a panoramic image based scene reconstruction and localization approach of the present application;

FIG. 3 is a schematic diagram of unfolding a panoramic image into an aperture image in one embodiment of the present application;

FIG. 4A is a schematic illustration of latitude and longitude coordinates of a spherical image and rectangular coordinates of a panoramic image;

FIG. 4B is a schematic diagram of rectangular coordinates of the first pinhole image and longitude and latitude coordinates of the spherical-view image;

FIG. 4C is a schematic diagram between the coordinates of the pinhole image and the coordinates of the viewing sphere;

fig. 5 is a schematic structural diagram of a map building apparatus according to an embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

The embodiment of the application provides a map construction method, which is used for constructing a three-dimensional visual map of a target scene, and then using the three-dimensional visual map to perform global positioning on terminal equipment of the target scene, namely, using the three-dimensional visual map to perform global positioning on the terminal equipment in the moving process of the target scene. Referring to fig. 1, a schematic flow chart of a map construction method is shown, where the method may include:

step 101, acquiring a panoramic image of a target scene, and generating a first pinhole image corresponding to a first virtual camera based on the panoramic image. Illustratively, the position of the first virtual camera is a center position of a sphere from a spherical coordinate system, and the initial pose of the first virtual camera is any pose centered on the center position.

For example, generating a first pinhole image corresponding to the first virtual camera based on the panoramic image may include, but is not limited to: and generating a spherical viewing surface image corresponding to the spherical viewing surface coordinate system based on the panoramic image, and generating a first small hole image corresponding to the first virtual camera based on the spherical viewing surface image.

The generating of the spherical view image corresponding to the spherical view coordinate system based on the panoramic image may include, but is not limited to: determining a mapping relation between longitude and latitude coordinates in the spherical view image and rectangular coordinates in the panoramic image based on the width and the height of the panoramic image; for each longitude and latitude coordinate in the spherical-view image, a rectangular coordinate corresponding to the longitude and latitude coordinate may be determined from the panoramic image based on the mapping relationship, and a pixel value of the longitude and latitude coordinate may be determined based on a pixel value of the rectangular coordinate. On this basis, the apparent spherical image can be generated based on the pixel value of each latitude and longitude coordinate in the apparent spherical image.

Wherein, generating the first pinhole image corresponding to the first virtual camera based on the spherical view surface image may include, but is not limited to: determining a central point coordinate of the first small hole image based on the width and the height of the first small hole image, and determining a mapping relation between a rectangular coordinate in the first small hole image and a longitude and latitude coordinate in the spherical-surface-viewing image based on the central point coordinate and a target distance, wherein the target distance is a distance between the central point of the first small hole image and a sphere center position of a spherical-surface-viewing coordinate system. And for each rectangular coordinate in the first small hole image, determining a longitude and latitude coordinate corresponding to the rectangular coordinate from the spherical viewing surface image based on the mapping relation, and determining a pixel value of the rectangular coordinate based on a pixel value of the longitude and latitude coordinate. On this basis, the first pinhole image may be generated based on the pixel values of each rectangular coordinate in the first pinhole image.

102, determining a rotation matrix between a target posture of a second virtual camera and an initial posture of a first virtual camera, and determining a reference matrix between the target posture and the initial posture based on the rotation matrix; for example, the position of the second virtual camera may be a center position of the view spherical coordinate system, and the target pose is obtained by rotating the initial pose about a coordinate axis of the view spherical coordinate system.

For example, determining a rotation matrix between the target pose of the second virtual camera and the initial pose of the first virtual camera may include, but is not limited to: determining a first rotation angle between the target posture and the initial posture in the first coordinate axis direction, and determining a first sub-rotation matrix in the first coordinate axis direction based on the first rotation angle; determining a second rotation angle between the target posture and the initial posture in the second coordinate axis direction, and determining a second sub-rotation matrix in the second coordinate axis direction based on the second rotation angle; and determining a third rotation angle between the target posture and the initial posture in the third coordinate axis direction, and determining a third sub-rotation matrix in the third coordinate axis direction based on the third rotation angle. On this basis, a rotation matrix between the target pose and the initial pose is determined based on the first sub-rotation matrix, the second sub-rotation matrix, and the third sub-rotation matrix.

In one possible embodiment, determining the external reference matrix between the target pose and the initial pose based on the rotation matrix may include, but is not limited to: a translation matrix between the first virtual camera and the second virtual camera is determined, and the external parameter matrix is determined based on the rotation matrix and the translation matrix.

Step 103, determining a second pinhole image corresponding to the second virtual camera based on the rotation matrix, selecting a two-dimensional feature point corresponding to the actual position of the target scene from the first pinhole image and the second pinhole image, and determining a three-dimensional map point corresponding to the actual position based on the two-dimensional feature point and the external reference matrix.

In a possible implementation, the determining the three-dimensional map point corresponding to the actual position based on the two-dimensional feature point and the external reference matrix may include, but is not limited to: determining a target loss value of the configured loss function; determining a projection function value between a coordinate system of the virtual camera and a coordinate system of the pinhole image based on the target loss value and the two-dimensional feature point corresponding to the actual position; and determining a three-dimensional map point corresponding to the actual position based on the external reference matrix and the projection function value, namely, the three-dimensional map point is used as a three-dimensional map point in the three-dimensional visual map.

And 104, constructing a three-dimensional visual map of the target scene based on the plurality of three-dimensional map points of the target scene.

By way of example, the three-dimensional visual map may include, but is not limited to: a sample global descriptor corresponding to the sample image, a three-dimensional map point corresponding to the sample image and a sample local descriptor corresponding to the three-dimensional map point; wherein the sample image is an aperture image selected from the first aperture image and the second aperture image.

In a possible implementation manner, after step 104, in the global positioning process of the terminal device, a target image of the terminal device in a target scene is obtained; selecting candidate sample images from the multi-frame sample images based on the similarity between the target image and the multi-frame sample images corresponding to the three-dimensional visual map; acquiring a plurality of feature points from a target image; aiming at each feature point, determining a target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image; and determining a global positioning pose in the three-dimensional visual map corresponding to the target image based on the plurality of feature points and the target three-dimensional map points corresponding to the plurality of feature points.

The candidate sample image is selected from the multiple frame sample images based on the similarity between the target image and the multiple frame sample images corresponding to the three-dimensional visual map, and the candidate sample image may include, but is not limited to: determining a global descriptor to be detected corresponding to the target image, and determining the distance between the global descriptor to be detected and a sample global descriptor corresponding to each frame of sample image corresponding to the three-dimensional visual map. Selecting a candidate sample image from the multi-frame sample images based on the distance between the global descriptor to be detected and each sample global descriptor; the distance between the global descriptor to be detected and the sample global descriptor corresponding to the candidate sample image is the minimum distance; or the distance between the global descriptor to be detected and the sample global descriptor corresponding to the candidate sample image is smaller than the distance threshold.

The determining of the target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image may include, but is not limited to: determining a local descriptor to be detected corresponding to the feature point, wherein the local descriptor to be detected can be used for representing a feature vector of an image block where the feature point is located, and the image block can be located in the target image; determining the distance between the local descriptor to be tested and the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image; on the basis, selecting target three-dimensional map points from the three-dimensional map points corresponding to the candidate sample images based on the distance between the local descriptor to be detected and each sample local descriptor; the distance between the local descriptor to be measured and the sample local descriptor corresponding to the target three-dimensional map point may be a minimum distance, and the minimum distance is smaller than a distance threshold.

The determining of the global descriptor to be detected corresponding to the target image may include, but is not limited to: determining a bag-of-words vector corresponding to the target image based on the trained dictionary model, and determining the bag-of-words vector as a global descriptor to be tested; or inputting the target image to the trained deep learning model to obtain a target vector corresponding to the target image, and determining the target vector as a global descriptor to be detected. Of course, the above are only two examples of determining the global descriptor to be tested, and the determination method of the global descriptor to be tested is not limited.

The map construction method according to the embodiment of the present application will be described below with reference to specific embodiments.

The data source for implementing the positioning function may include a GPS, a laser radar, a millimeter wave radar, a vision sensor (e.g., a camera), and the like, and the positioning function may be implemented using the data source. The GPS is easily affected by satellite conditions, weather conditions, and data transmission conditions, and cannot be used in an indoor environment. The laser radar and the millimeter wave radar have the advantages of small calculation amount, depth information providing, no influence of illumination and the like, but the laser radar and the millimeter wave radar have sparse information and high price and do not have the possibility of wide use. In comparison, the visual information provided by the visual sensor is influenced by illumination and weather, but the visual sensor is low in cost, small in size, easy to install and rich in content, and has a great application prospect in the aspect of map positioning.

In order to realize a positioning function by adopting a visual sensor, a high-precision three-dimensional visual map is generally required to be constructed, the three-dimensional visual map is used for sensing environment priori knowledge, global positioning can be carried out on the basis of the three-dimensional visual map, and a global positioning pose of the terminal equipment in the three-dimensional visual map is obtained, so that the positioning function is realized.

When a three-dimensional visual map is constructed, a monocular camera is usually used to acquire images of a target scene, and based on the images, three-dimensional reconstruction can be performed through an SFM (Structure From Motion) algorithm to obtain the three-dimensional visual map. However, the monocular camera has a narrow field angle, which results in insufficient coverage of the scene, and needs to repeatedly acquire images of the target scene to reconstruct the images, so that the image construction efficiency is low.

In order to solve the above problems, an embodiment of the present application provides a scene reconstruction and positioning method based on a panoramic image, which may be implemented by using a panoramic camera to collect the panoramic image, and performing three-dimensional reconstruction based on the panoramic image to obtain a three-dimensional visual map, where the three-dimensional reconstruction refers to a process of constructing a physical world scene into a three-dimensional point cloud through an SFM algorithm. The panoramic camera has a very large field angle, so that repeated acquisition of images of a target scene can be avoided, the data acquisition efficiency is improved, the problem caused by a small field angle of the monocular camera can be solved, the quality of the three-dimensional visual map is obviously higher than the mapping effect of the monocular camera, the mapping efficiency is higher, and the mapping precision and robustness are improved.

Referring to fig. 2, a schematic diagram of a scene reconstruction and positioning method based on a panoramic image may include an offline mapping process and an online positioning process. Aiming at the off-line map building process, a panoramic image of a target scene (namely, a scene needing to build a three-dimensional visual map) can be collected, the panoramic image is expanded into an aperture image, SFM (small form-factor) reconstruction bound by multiple cameras is carried out on the basis of the aperture image and fixed connection constraint (namely, an external reference matrix), the three-dimensional visual map is obtained, and map information corresponding to the three-dimensional visual map is stored. Aiming at the on-line positioning process, a target image of a target scene can be collected, attitude calculation is carried out on the target image based on a three-dimensional visual map of the target scene, a global positioning pose in the three-dimensional visual map corresponding to the target image is obtained, and the positioning process is completed.

Firstly, a panoramic image of a target scene can be collected, and the panoramic image is expanded into at least two pinhole images, as shown in fig. 3, which is a schematic diagram of expanding the panoramic image into the pinhole images, the process includes:

step 301, acquiring a panoramic image of a target scene. For example, a panoramic video of a target scene may be acquired, and the panoramic video may be converted into a multi-frame panoramic image, which is not limited to this process.

And 302, generating a spherical view image corresponding to the spherical view coordinate system based on the panoramic image.

For example, assuming that the mapping relationship between the spherical-view image and the panoramic image is longitude and latitude mapping, the longitude and latitude coordinates of the spherical-view image

The relationship with the rectangular coordinates (x, y) of the panoramic image is shown in equation (1):

in the formula (1), λ and

respectively, the longitude and latitude coordinates of the view spherical image (i.e., the image in the view spherical coordinate system), and x and y respectively, the horizontal and vertical coordinates of the panoramic image.

Referring to fig. 4A, a schematic diagram of a relationship between longitude and latitude coordinates of the spherical-view image and rectangular coordinates of the panoramic image is shown, the left side is a schematic diagram of the spherical-view image, and the right side is a schematic diagram of the panoramic image. In the view spherical image, the longitude coordinate is in the range of lambda ∈ [0,2 π ]]The value range of the latitude coordinate is

In the panoramic image, the aspect ratio of the panoramic image is W: H2: 1, and the center position of the panoramic image is the origin (0, 0).

Continuing with FIG. 4A, looking at the latitude and longitude coordinates in the spherical image

From rectangular coordinates (u) in the panoramic image_p,v_p) The mapping relationship between the two can be seen in formula (2).

As can be seen from formula (2), based on the width W of the panoramic image and the height H of the panoramic image, the mapping relationship between the longitude and latitude coordinates in the spherical-view image and the rectangular coordinates in the panoramic image can be determined. Obviously for each latitude and longitude coordinate in the spherical-view image, e.g.

The latitude and longitude coordinates can be determined from the panoramic image based on the mapping relation

Corresponding rectangular coordinates(u_p,v_p) And based on rectangular coordinates (u)_p,v_p) Determining latitude and longitude coordinates of the pixel values

Pixel value of (a), i.e. rectangular coordinate (u)_p,v_p) As latitude and longitude coordinates

The pixel value of (2). On the basis, the pixel values of each longitude and latitude coordinate in the spherical view image can be combined into a spherical view image, so that a spherical view image can be obtained.

In summary, a spherical viewing surface image in a spherical viewing surface coordinate system can be obtained, the spherical viewing surface coordinate system can be a spherical coordinate system, the spherical viewing surface coordinate system is not limited, the spherical viewing surface image is the spherical viewing surface image in the spherical viewing surface coordinate system, and the relationship between the spherical viewing surface image and the panoramic image is shown in fig. 4A.

And 303, generating a first pinhole image corresponding to the first virtual camera based on the view spherical surface image. The spherical viewing surface image is an image in a spherical viewing surface coordinate system, the position of the first virtual camera is a sphere center position in the spherical viewing surface coordinate system, and the initial posture of the first virtual camera is any posture with the sphere center position as a center.

For example, the view spherical image may be expanded into pinhole images from multiple viewpoints, each pinhole image corresponds to a virtual camera, the virtual camera does not exist in a real scene, and is a camera virtualized at the center of sphere position of the view spherical coordinate system, that is, the position of the virtual camera coincides with the center of sphere position of the view spherical coordinate system.

The apparent spherical coordinate system may be a three-dimensional coordinate system, i.e., there are an X-axis, a Y-axis, and a Z-axis, and for each virtual camera, the pose of the virtual camera may be any pose centered on the center position of the sphere. For example, the virtual camera corresponds to three orientations, the orientation in the first orientation coincides with the X-axis of the view spherical coordinate system (i.e., rotates 0 degrees around the X-axis), the orientation in the second orientation coincides with the Y-axis of the view spherical coordinate system (i.e., rotates 0 degrees around the Y-axis), and the orientation in the third orientation coincides with the Z-axis of the view spherical coordinate system (i.e., rotates 0 degrees around the Z-axis). For another example, the virtual camera rotates the attitude of the first direction by 60 degrees around the X axis, the attitude of the second direction coincides with the Y axis, and the attitude of the third direction coincides with the Z axis. For another example, the virtual camera may rotate 120 degrees around the X-axis in the first direction, 60 degrees around the Y-axis in the second direction, and the Z-axis in the third direction. In summary, the pose of the virtual camera can be obtained by rotating around the coordinate axes (such as the X axis, the Y axis and the Z axis) of the spherical-view coordinate system, and the rotation angle is not limited.

For example, the virtual camera in any posture may be referred to as a first virtual camera, and the posture of the first virtual camera may be referred to as an initial posture, and it is obvious that the position of the first virtual camera is a sphere center position of a view spherical coordinate system, and the initial posture of the first virtual camera is any posture centered on the sphere center position.

In this embodiment, it is taken as an example that the first direction of the initial posture coincides with the X axis of the spherical viewing surface coordinate system, the second direction of the initial posture coincides with the Y axis of the spherical viewing surface coordinate system, and the third direction of the initial posture coincides with the Z axis of the spherical viewing surface coordinate system, and the above is only an example of the initial posture, and is not limited thereto.

Referring to fig. 4B, a schematic diagram of a relationship between a rectangular coordinate of the first pinhole image (i.e., the pinhole image corresponding to the first virtual camera) and a longitude and latitude coordinate of the spherical-surface-viewing image is shown, where the left side is the schematic diagram of the spherical-surface-viewing image, the right side is the schematic diagram of the first pinhole image, and the spherical-surface-viewing image may be an image of a unit sphere.

Referring to fig. 4B, a pinhole camera, i.e. a first virtual camera, is virtualized at the center of the sphere of the spherical coordinate system, and an image corresponding to the first virtual camera is a first pinhole image in the view plane. Assuming that the width of the first small hole image is w and the height of the first small hole image is h, the coordinate of the center point of the first small hole image is (u)₀,v₀) (w/2, h/2). Suppose that the rectangular coordinate of point Q on the first pinhole image is (u)_q,v_q) The connecting line of the point Q and the sphere center position O of the visual sphere coordinate system intersects the visual sphereAt point Q_sThe distance between the sphere center position O of the spherical coordinate system and the center point of the first small hole image is d. Based on this, when the rectangular coordinates of the point Q on the first pinhole image are converted into three-dimensional rectangular coordinates with the unit sphere center as the origin, see formula (3):

as can be seen from fig. 4B, there is a ratio between the three-dimensional coordinates of the point Qs and the point Q, the ratio is based on the similarity principle of triangles, and the relationship of the ratio can be seen in formula (4):

due to OQ_s＝1，

Thus, point Q_sCan be seen in equation (5):

in practical applications, the point coordinates of the spherical surface can also be expressed by latitude and longitude, as shown in formula (6):

by combining the formula (3), the formula (5) and the formula (6), the relationship between the rectangular coordinates of the first small hole image and the longitude and latitude coordinates of the spherical viewing surface image can be obtained, which is shown in the formula (7) and is an example of the relationship.

In the formula (7), λ and

respectively representing the longitude and latitude coordinates, u, of the spherical-view image₀＝w/2，v₀W denotes the width of the first pinhole image, h denotes the height of the first pinhole image, and w and h are both known values. d represents the distance between the sphere center position O of the spherical coordinate system and the center point of the first small hole image, is a known value, can be configured according to experience, and can also be calculated by adopting a certain algorithm to obtain the value of d, which is not limited. Based on the above formula (7), the rectangular coordinates on the first pinhole image and the longitude and latitude coordinates of the spherical-view image can be projected, i.e. the spherical-view image is converted into the first pinhole image.

In summary, based on the width w and the height h of the first pinhole image, the center point coordinate (u) of the first pinhole image can be determined₀,v₀) Based on the coordinates of the center point and the target distance d, the rectangular coordinates (u, v) in the first pinhole image and the longitude and latitude coordinates in the spherical-view image can be determined

The mapping relationship can be shown in formula (7). Obviously, for each rectangular coordinate in the first pinhole image, the longitude and latitude coordinate corresponding to the rectangular coordinate may be determined from the spherical-view image based on the mapping relationship, and the pixel value of the rectangular coordinate may be determined based on the pixel value of the longitude and latitude coordinate, that is, the pixel value of the longitude and latitude coordinate is taken as the pixel value of the rectangular coordinate. On the basis, the pixel values of each rectangular coordinate in the first small hole image can be combined into the first small hole image, so that a first small hole image is obtained.

In a possible implementation manner, after the panoramic image of the target scene is obtained, the panoramic image may also be directly converted into the first pinhole image corresponding to the first virtual camera, instead of converting the panoramic image into the spherical view image corresponding to the spherical view coordinate system, and the first pinhole image is generated based on the spherical view image. For example, the relationship between the rectangular coordinate of the first pinhole image and the rectangular coordinate of the panoramic image can be obtained, as shown in formula (8), the derivation process of formula (8) is shown in formula (2), formula (3), formula (5) and formula (6), and the meaning of each letter in formula (8) is shown in the above formula, and is not described herein again.

As can be seen from equation (8), for each rectangular coordinate (u, v) in the first pinhole image, the rectangular coordinate (u, v) corresponding to the rectangular coordinate (u, v) can be determined from the panoramic image based on the mapping relationship_p,v_p) And based on the rectangular coordinate (u)_p,v_p) Determines the pixel value of the rectangular coordinate (u, v), i.e. the rectangular coordinate (u, v)_p,v_p) As the pixel value of the rectangular coordinates (u, v). On the basis of the first image, the pixel values of each rectangular coordinate in the first small hole image can be combined into the first small hole image, so that the first small hole image is obtained.

Step 304, determining a rotation matrix between a target pose of the second virtual camera and an initial pose of the first virtual camera, for example, the position of the second virtual camera may be a center of sphere position of the spherical coordinate system, and the target pose is obtained by rotating the initial pose around coordinate axes of the spherical coordinate system.

For example, the view spherical image may be expanded into pinhole images at a plurality of viewpoints, each pinhole image corresponds to a virtual camera, and the virtual camera is a camera which is virtual at the spherical center position of the view spherical coordinate system, that is, the position of the virtual camera coincides with the spherical center position of the view spherical coordinate system. The pose of the virtual camera can be obtained by rotating around the coordinate axes (such as the X axis, the Y axis and the Z axis) of the spherical coordinate system, and the rotating angle is not limited.

On the basis of knowing the initial posture of the first virtual camera, the target posture of the second virtual camera is obtained by rotating the initial posture around the coordinate axis of the visual spherical coordinate system, for example, the first direction of the target posture is obtained by rotating the initial posture by 60 degrees around the X axis, the second direction of the target posture is obtained by rotating the initial posture by 60 degrees around the Y axis, and the third direction of the target posture is obtained by rotating the initial posture by 0 degree around the Z axis. Assuming that the first direction of the initial pose coincides with the X-axis, the second direction of the initial pose coincides with the Y-axis, and the third direction of the initial pose coincides with the Z-axis, then the first direction of the target pose is rotated 60 degrees about the X-axis, the second direction of the target pose is rotated 60 degrees about the Y-axis, and the third direction of the target pose coincides with the Z-axis. For another example, the first direction of the target pose is obtained by rotating the initial pose by 120 degrees around the X-axis, the second direction of the target pose is obtained by rotating the initial pose by 90 degrees around the Y-axis, the third direction of the target pose is obtained by rotating the initial pose by 0 degrees around the Z-axis, and so on, and the rotation relationship between the target pose and the initial pose is not limited.

In summary, based on the target pose of the second virtual camera and the initial pose of the first virtual camera, a first rotation angle between the target pose and the initial pose in the first coordinate axis direction, such as a rotation angle around the X axis for the initial pose, denoted as a_XA second rotation angle between the target attitude and the initial attitude in the direction of the second coordinate axis, i.e. the rotation angle about the Y axis for the initial attitude, can be determined, the second rotation angle being denoted A_YA third rotation angle between the target attitude and the initial attitude in the direction of the third coordinate axis, i.e. the rotation angle about the Z axis for the initial attitude, can be determined, which third rotation angle is denoted A_Z。

Based on the first rotation angle A_XA second rotation angle A_YAnd a third angle of rotation A_ZA rotation matrix between the target pose of the second virtual camera and the initial pose of the first virtual camera may be determined, e.g., based on the first rotation angle a_XDetermining a first sub-rotation matrix R of a first coordinate axis direction_xBased on the second angle of rotation A_YDetermining a second sub-rotation matrix R for a second coordinate axis direction_yBased on the third angle of rotation A_ZDetermining a third sub-rotation matrix R of a third coordinate axis direction_z. Then, the user can use the device to perform the operation,based on the first sub-rotation matrix R_xA second sub-rotation matrix R_yAnd a third sub-rotation matrix R_zA rotation matrix between the target pose and the initial pose can be determined.

The rotation around the Z-axis direction does not conform to the shooting habit, namely the rotation angle around the Z-axis for the initial posture is usually 0 degree, and the third sub-rotation matrix R_zUsually 1, for which the third sub-rotation matrix R_zWithout limitation, to base on the first sub-rotation matrix R_xAnd a second sub-rotation matrix R_yThe rotation matrix is determined for the example.

See formula (9), based on the first rotation angle A_XDetermining a first sub-rotation matrix R_xSee equation (10), is based on the second rotation angle A_YDetermining a second sub-rotation matrix R_yBased on the first sub-rotation matrix R_xAnd a second sub-rotation matrix R_yThe rotation matrix R can be obtained as shown in equation (11).

R＝R_yR_xFormula (11)

In summary, the first rotation angle A between the target posture of the second virtual camera and the initial posture of the first virtual camera is used_XAnd a second angle of rotation A_YA first sub-rotation matrix R can be determined_xAnd a second sub-rotation matrix R_yAnd then obtaining a rotation matrix R between the target attitude and the initial attitude.

For example, the number of the second virtual cameras may be at least one, and for each second virtual camera, the target pose corresponding to the second virtual camera may be known, and then a rotation matrix R between the target pose of the second virtual camera and the initial pose of the first virtual camera is obtained.For example, referring to fig. 4C, which is a schematic diagram of a relationship between coordinates of the pinhole image and coordinates of the viewing sphere, a viewpoint of the pinhole camera is virtualized at an equal angular interval around the Y axis at the center of the viewing sphere coordinate system, and the second rotation angle a is obtained_YIs {0, 1/3, 2/3, pi, 4/3, 5/3, pi }, and the first rotation angle A_XIs 0. On the basis, 6 virtual cameras can be obtained, and the first rotation angle A of the first virtual camera_XIs 0, the second rotation angle A_YWhich is 0, this virtual camera is the first virtual camera. First rotation angle A of the second virtual camera_XIs 0, the second rotation angle A_YTo 1/3 x pi, this virtual camera is denoted as the second virtual camera 1, and a rotation matrix R between the target pose of the second virtual camera 1 and the initial pose of the first virtual camera is determined based on formula (9) -formula (11). By analogy, the first rotation angle A of the sixth virtual camera_XIs 0, the second rotation angle A_YTo 5/3 x pi, this virtual camera is denoted as the second virtual camera 5, and a rotation matrix R between the target pose of the second virtual camera 5 and the initial pose of the first virtual camera is determined based on formula (9) -formula (11).

In summary, when there are multiple second virtual cameras, the target poses of different second virtual cameras may be different, and the rotation matrix R corresponding to each second virtual camera may be determined.

Step 305, determining a reference matrix between the target posture and the initial posture based on a rotation matrix between the target posture of the second virtual camera and the initial posture of the first virtual camera. For example, a translation matrix between the first virtual camera and the second virtual camera may be determined, and then a reference matrix between the target pose and the initial pose may be determined based on the rotation matrix and the translation matrix.

For example, since the second virtual camera and the first virtual camera are fixedly connected and optically concentric, that is, the position of the second virtual camera is the spherical center position of the spherical coordinate system, and the position of the first virtual camera is also the spherical center position of the spherical coordinate system, the first virtual camera and the second virtual camera are connected with each otherThe translation matrix of (a) may be

I.e. no translation between the positions of the first virtual camera and the second virtual camera has taken place.

For example, after obtaining the rotation matrix and the translation matrix, an external reference matrix can be obtained based on the rotation matrix and the translation matrix, as shown in equation (12), which is an example of determining the external reference matrix.

In the formula (12), c₀Representing a first virtual camera (i.e. a reference camera), c_iRepresents the i-th second virtual camera,

represents a rotation matrix between the i-th second virtual camera and the first virtual camera, as shown in equation (9) -equation (11),

representing a translation matrix between the ith second virtual camera and the first virtual camera, i.e.

Representing an external parameter matrix between the ith second virtual camera and the first virtual camera

Will be used in the subsequent multi-camera binding optimization process, see the subsequent embodiments.

In summary, when there are a plurality of second virtual cameras, a parameter matrix between each second virtual camera and the first virtual camera may be determined, which may include a rotation matrix and a translation matrix.

Step 306, determining a second pinhole image corresponding to the second virtual camera based on the rotation matrix between the target pose of the second virtual camera and the initial pose of the first virtual camera. For example, the second pinhole image is determined based on the rotation matrix and the first pinhole image, or the second pinhole image is determined based on the rotation matrix and the view sphere image. Of course, the above manner is only an example, and the present embodiment does not limit this.

For example, referring to FIG. 4B, a point Q on the spherical image may be viewed based on a rotation matrix R between the target pose of the second virtual camera and the initial pose of the first virtual camera_sRotating to obtain a rotated point Q_s', and point Q after rotation_s' with Point Q before rotation_sSee equation (13):

obviously, see point Q on the spherical image_sCorresponding to the point Q on the first pinhole image, and the point Q on the spherical surface image_s' corresponds to the point Q ' on the second pinhole image, and based on this, the relationship between the point Q ' on the second pinhole image and the point Q on the first pinhole image can be seen in equation (14):

in summary, it can be seen that, based on the rotation matrix R, a mapping relationship between the coordinates on the first pinhole image and the coordinates on the second pinhole image can be determined, the mapping relationship can be shown in formula (14), and based on the mapping relationship, the first pinhole image can be converted into the second pinhole image, which is not described herein again.

In another possible implementation, the formula (13) may be substituted into the formula (6) and the formula (7) to obtain a mapping relationship between the coordinates on the spherical viewing surface image and the coordinates on the second pinhole image, and based on the mapping relationship, the spherical viewing surface image may be converted into the second pinhole image, which is not described herein again.

For example, when generating pixels of a pinhole image (such as a first pinhole image or a second pinhole image), interpolation calculation may also be performed, for example, coordinates of the panoramic image may be floating point values, and interpolation calculation may be performed in a bilinear interpolation manner to obtain pixels of the pinhole image, which is not limited in this process.

In summary, the external reference matrix between the first pinhole image and the second pinhole image and between the virtual cameras can be obtained. For example, assuming that there are a first virtual camera, a second virtual camera 1, and a second virtual camera 2, a first pinhole image corresponding to the first virtual camera, a second pinhole image 1 corresponding to the second virtual camera 1, and a second pinhole image 2 corresponding to the second virtual camera 2 are obtained, and an external parameter matrix 11 between the second virtual camera 1 and the first virtual camera, and an external parameter matrix 21 between the second virtual camera 2 and the first virtual camera are obtained.

And secondly, building a multi-camera binding map, for example, based on the first pinhole image, the second pinhole image and the fixed connection constraint (namely, an external reference matrix), performing multi-camera binding SFM reconstruction to obtain a three-dimensional visual map.

Illustratively, for a first pinhole image and a second pinhole image (the number of the first pinhole image is one, and the number of the second pinhole image is at least one), two-dimensional feature points corresponding to the actual positions of a target scene may be selected from the first pinhole image and the second pinhole image (in practical application, multiple frames of panoramic images at different times may be obtained, each frame of panoramic image corresponds to the first pinhole image and the second pinhole image, and multiple two-dimensional feature points corresponding to the actual positions may be selected from the first pinhole image and the second pinhole image, that is, the two-dimensional feature points are feature points in all the pinhole images at different times). For example, for a certain actual position (i.e., an actual physical position) of the target scene, a two-dimensional feature point corresponding to the actual position may be selected from the first pinhole image, and a two-dimensional feature point corresponding to the actual position may be selected from the second pinhole image, so as to obtain a plurality of two-dimensional feature points. And determining the three-dimensional map point corresponding to the actual position based on the two-dimensional feature points and the external reference matrix. For example, determining a target loss value of the configured loss function; determining a projection function value between a coordinate system of the virtual camera and a coordinate system of the pinhole image based on the target loss value and the two-dimensional feature point corresponding to the actual position; and determining the three-dimensional map point corresponding to the actual position based on the external reference matrix and the projection function value, namely, the three-dimensional map point is used as the three-dimensional map point in the three-dimensional visual map.

For example, in the multi-camera bound SFM reconstruction process, the fixed connection constraint (i.e., the constraint that a camera in a group has an external reference matrix) between cameras is considered, and the fixed connection constraint (i.e., the external reference matrix) between virtual cameras can be added into the optimization, the camera binding optimization mainly includes the binding optimization of the rigid body connection constraint added between cameras in the reconstruction process, each camera fixed connection group set is composed of a plurality of snapshots having the same camera fixed connection constraint, each snapshot is composed of images taken by each camera in the camera group at the same time, and the external reference between cameras in the group is the external reference of the above embodiment

This fixed constraint will be added to the subsequent reconstruction process.

For example, the reprojection error of a solid set can be expressed as shown in equation (15):

in the formula (15), S_kRepresenting a solid connected set, S assuming there are 5 second virtual cameras_kThen representing the fixed connection group formed by the 5 second virtual cameras, the value range of i is 1-5, namely c₁Representing a first and a second virtual camera, c₂Representing the first and second virtual camera, and so on. j represents a two-dimensional feature point, i.e., a two-dimensional feature point corresponding to the actual position (a two-dimensional feature point determined from all pinhole images), and when j is 1, it represents the second one corresponding to the actual positionAnd when j is 2, one two-dimensional feature point represents a second two-dimensional feature point corresponding to the actual position, and the like. Rho_jAnd representing a kernel function of the jth two-dimensional characteristic point, and suppressing a large outlier error. x is the number of_jAnd 2D point coordinates of the jth two-dimensional characteristic point, namely the jth two-dimensional characteristic point, in the pinhole image are represented. Pi denotes the projection function between the coordinate system of the virtual camera and the coordinate system of the pinhole image, i.e. the projection function from the camera system to the image system.

Denotes the external reference matrix within the anchor group, c₀Representing a first virtual camera, c_iRepresenting the ith second virtual camera, i.e.

And the external parameter matrix between the ith second virtual camera and the first virtual camera is represented, when i is 1, the external parameter matrix between the 1 st second virtual camera and the first virtual camera is represented, when i is 2, the external parameter matrix between the 2 nd second virtual camera and the first virtual camera is represented, and so on, wherein the external parameter matrices are all obtained external parameter matrices. T is_wcAnd representing the pose matrix of the reference camera in the fixed connection group, usually the No. 0 camera, namely the pose matrix of the first virtual camera. X_kAnd representing the three-dimensional map points corresponding to the actual positions, namely the three-dimensional map points required to be obtained, and representing the 3D points of the scene. For any camera in the fixed connection group, the camera can pass through T_wcAnd

is obtained by

In equation (15), the parameters to be optimized include the pose matrix of the reference camera (i.e., the pose matrix T of the first virtual camera)_wc) And three-dimensional map point X_kAnd a reference matrix

And a feature point x_jAs is known, therefore, during the multi-camera bundled SFM reconstruction process, the above-mentioned loss function can be minimized, for example, by lm (levenberg acquirert) algorithm, and finally the pose matrix T of the reference camera is obtained_wcAnd three-dimensional map point X_kThat is, a three-dimensional map point X corresponding to the actual position can be obtained_k。

In summary, the function of the reprojection error (shown in equation (15)) can be used as the configured loss function, the loss function can be minimized by the LM algorithm, and the minimized value can be used as the target loss value of the loss function. As can be seen from equation (15), the two-dimensional feature point x corresponding to the actual position based on the target loss value_jThe value of the projection function between the coordinate system of the virtual camera and the coordinate system of the pinhole image can be determined, i.e.

Then, based on the external parameter matrix

And projection function value

The pose matrix T of the reference camera can be deduced_wcAnd three-dimensional map point X_kThat is, a three-dimensional map point X corresponding to the actual position can be obtained_kI.e. as three-dimensional map points in a three-dimensional visual map.

For a plurality of actual positions of the target scene, the three-dimensional map points corresponding to each actual position may be obtained in the above manner, so that the three-dimensional visual map of the target scene is constructed based on the plurality of three-dimensional map points of the target scene, that is, the three-dimensional visual map may include the plurality of three-dimensional map points of the target scene.

And thirdly, storing map information corresponding to the three-dimensional visual map, for example, storing the map information corresponding to the three-dimensional visual map in a visual feature database. For example, the map information corresponding to the three-dimensional visual map may include, but is not limited to: the method includes the steps of obtaining a sample global descriptor corresponding to a sample image, a three-dimensional map point corresponding to the sample image and a sample local descriptor corresponding to the three-dimensional map point, wherein the sample image is a pinhole image selected from a first pinhole image and a second pinhole image, for example, all the pinhole images are used as the sample image, or a part of the pinhole images are selected from all the pinhole images as the sample image, which is not limited to this.

After obtaining the three-dimensional visual map of the target scene, the three-dimensional visual map may include the following information:

pose of sample image: the sample image is an image when the three-dimensional visual map is constructed, that is, the first pinhole image, the second pinhole image, and the like, that is, the three-dimensional visual map can be constructed based on the sample image, and the pose matrix (which may be referred to as sample image pose for short) of the sample image can be stored in the three-dimensional visual map, that is, the three-dimensional visual map can include the pose of the sample image. Referring to the above embodiments, the pose of the sample image may be the pose matrix T of the reference camera corresponding to the first pinhole image corresponding to the sample image_wc。

Sample global descriptor: for each frame of sample image, the sample image may correspond to an image global descriptor, and the image global descriptor is denoted as a sample global descriptor, where the sample global descriptor represents the sample image by using a high-dimensional vector, and the sample global descriptor is used to distinguish image features of different sample images.

For each frame of sample image, determining a bag-of-words vector corresponding to the sample image based on the trained dictionary model, and determining the bag-of-words vector as a sample global descriptor corresponding to the sample image. For example, a Bag of Words (Bag of Words) method is a way for determining a global descriptor, and in the Bag of Words method, a Bag of Words vector can be constructed, which is a vector representation method for image similarity detection, and the Bag of Words vector can be used as a sample global descriptor corresponding to a sample image.

In the visual bag-of-words method, a "dictionary", which may also be referred to as a dictionary model, needs to be trained in advance, and a classification tree is obtained by clustering feature point descriptors in a large number of images and training, wherein each classification tree can represent a visual "word", and the visual "words" form the dictionary model.

For a sample image, all feature point descriptors in the sample image may be classified as words, and the occurrence frequency of all words is counted, so that the frequency of each word in a dictionary may form a vector, the vector is a bag-of-word vector corresponding to the sample image, the bag-of-word vector may be used to measure the similarity of two images, and the bag-of-word vector is used as a sample global descriptor corresponding to the sample image.

For each frame of sample image, the sample image may be input to a trained deep learning model to obtain a target vector corresponding to the sample image, and the target vector is determined as a sample global descriptor corresponding to the sample image. For example, a deep learning method is a method for determining a global descriptor, in the deep learning method, a sample image may be subjected to multilayer convolution through a deep learning model, and a high-dimensional target vector is finally obtained, and the target vector is used as the sample global descriptor corresponding to the sample image.

In the deep learning method, a deep learning model, such as a CNN (Convolutional Neural Networks) model, needs to be trained in advance, and the deep learning model is generally obtained by training a large number of images, and the training mode of the deep learning model is not limited. For a sample image, the sample image may be input to a deep learning model, the deep learning model processes the sample image to obtain a high-dimensional target vector, and the target vector is used as a sample global descriptor corresponding to the sample image.

Sample local descriptors corresponding to feature points of the sample image: for a sample image, the sample image may include a plurality of feature points, where a feature point is a specific pixel position in the sample image, the feature point may correspond to an image local descriptor, and the image local descriptor is recorded as a sample local descriptor, and the sample local descriptor describes features of image blocks in a range near the feature point (i.e., a pixel position) with a vector, and the vector may also be referred to as a descriptor of the feature point. In summary, the sample local descriptor is a feature vector for representing an image block where the feature point is located, and the image block may be located in the sample image. For each feature point of the sample image, the feature point corresponds to a three-dimensional map point in the three-dimensional visual map, that is, a sample local descriptor corresponding to the feature point may be referred to as a sample local descriptor corresponding to the three-dimensional map point.

Wherein, algorithms such as ORB (Oriented FAST and Rotated FAST Transform), SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features), and the like can be adopted to extract Feature points from the sample image and determine the sample local descriptors corresponding to the Feature points. A deep learning algorithm (such as SuperPoint, DELF, D2-Net, etc.) may also be used to extract feature points from the sample image and determine a sample local descriptor corresponding to the feature points, which is not limited to this, as long as the feature points can be obtained and the sample local descriptor can be determined.

Map point information of three-dimensional map points: map point information may include, but is not limited to: the 3D spatial position of the three-dimensional map points, all observed sample images and the number of the corresponding 2D feature points.

And fourthly, performing an online positioning process. For example, a target image of a target scene may be acquired, and attitude calculation may be performed on the target image based on a three-dimensional visual map of the target scene to obtain a global positioning pose in the three-dimensional visual map corresponding to the target image, thereby completing the positioning process. For example, when global positioning of the terminal device needs to be performed based on the three-dimensional visual map, the terminal device may download the three-dimensional visual map from the server, and during the moving process of the terminal device in the target scene, the terminal device may be positioned based on the three-dimensional visual map. The target scene may be an indoor environment, that is, when the terminal device moves in the indoor environment, the global positioning pose of the terminal device in the three-dimensional visual map may be determined, and of course, the target scene may also be an outdoor environment or the like, which is not limited to this target scene. In the above embodiments, the pose (e.g., global positioning pose, etc.) may be a position and a posture, and is generally represented by a rotation matrix and a translation vector, which is not limited to this.

The terminal device may include a vision sensor, such as a camera, for capturing a target image of the target scene during movement of the terminal device (i.e., a real-time image during movement).

The terminal device can be a wearable device (such as a video helmet, a smart watch, smart glasses and the like), and the visual sensor is deployed on the wearable device; or the terminal device is a recorder (for example, the terminal device is carried by a worker during work and has the functions of collecting video and audio in real time, taking pictures, recording, talkbacking, positioning and the like), and the visual sensor is arranged on the recorder; alternatively, the terminal device is a camera (such as a split camera), and the vision sensor is disposed on the camera. Alternatively, the terminal device is a robot, and the vision sensor is disposed on the robot. Alternatively, the terminal device is an autonomous vehicle and the vision sensor is disposed on the autonomous vehicle. Of course, the above are only a few examples, and the examples are not limited thereto, and for example, the terminal device may also be a smart phone, as long as the terminal device is deployed with the visual sensor.

For example, after the target image is obtained, a target three-dimensional map point corresponding to the target image may be determined from the three-dimensional visual map of the target scene, and the global positioning pose of the terminal device in the three-dimensional visual map may be determined based on the target three-dimensional map point.

For example, based on a three-dimensional visual map of a target scene, in one possible implementation, the following steps may be adopted to determine a global positioning pose of a terminal device in the three-dimensional visual map:

step S11, in the global positioning process of the terminal device, acquiring a target image of the terminal device in a target scene, for example, acquiring a target image in the target scene through a visual sensor, that is, a real-time video image.

And step S12, determining the global descriptor to be tested corresponding to the target image.

For example, the target image may correspond to an image global descriptor, and the image global descriptor may be denoted as a global descriptor to be detected, where the global descriptor to be detected represents the target image by using a high-dimensional vector, and the global descriptor to be detected is used to distinguish image features of different target images. For example, a bag-of-words vector corresponding to the target image is determined based on the trained dictionary model, and the bag-of-words vector is determined as the global descriptor to be detected corresponding to the target image. Or inputting the target image to the trained deep learning model to obtain a target vector corresponding to the target image, and determining the target vector as a to-be-detected global descriptor corresponding to the target image.

In summary, the global descriptor to be detected corresponding to the target image may be determined based on a visual bag-of-words method or a deep learning method, and the determination manner refers to the determination manner of the sample global descriptor, which is not described herein again.

Step S13, determining a distance between the global descriptor to be measured (i.e. the global descriptor to be measured corresponding to the target image) and the sample global descriptor corresponding to each frame of sample image corresponding to the three-dimensional visual map.

Referring to the above embodiments, the three-dimensional visual map may include a sample global descriptor corresponding to each frame of the sample image, and therefore, a distance, such as a euclidean distance, between the global descriptor to be measured and the sample global descriptor corresponding to each frame of the sample image may be determined, that is, the euclidean distance between two feature vectors is calculated.

Step S14, selecting candidate sample images from multi-frame sample images corresponding to the three-dimensional visual map based on the distance between the global descriptor to be detected and each sample global descriptor; the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is the minimum distance; or, the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is smaller than the distance threshold.

For example, assuming that the three-dimensional visual map corresponds to the sample image 1, the sample image 2, and the sample image 3, the distance 1 between the global descriptor to be measured and the sample global descriptor corresponding to the sample image 1 may be calculated, the distance 2 between the global descriptor to be measured and the sample global descriptor corresponding to the sample image 2 may be calculated, and the distance 3 between the global descriptor to be measured and the sample global descriptor corresponding to the sample image 3 may be calculated.

In one possible embodiment, if the distance 1 is the minimum distance, the sample image 1 is selected as the candidate sample image. Alternatively, if the distance 1 is smaller than the distance threshold (which may be configured empirically), and the distance 2 is smaller than the distance threshold, but the distance 3 is not smaller than the distance threshold, then both the sample image 1 and the sample image 2 are selected as candidate sample images. Or, if the distance 1 is the minimum distance and the distance 1 is smaller than the distance threshold, the sample image 1 is selected as the candidate sample image, but if the distance 1 is the minimum distance and the distance 1 is not smaller than the distance threshold, the candidate sample image cannot be selected, that is, the global positioning fails.

In summary, candidate sample images may be selected from the multi-frame sample images corresponding to the three-dimensional visual map.

Step S15 is to acquire a plurality of feature points from the target image. For example, for each feature point, a local descriptor to be tested corresponding to the feature point may be determined, where the local descriptor to be tested may be used to represent a feature vector of an image block where the feature point is located, and the image block may be located in the target image.

For example, the target image may include a plurality of feature points, where a feature point may be a specific pixel position in the target image, the feature point may correspond to an image local descriptor, the image local descriptor is recorded as a local descriptor to be detected, the local descriptor to be detected describes the feature of an image block in a range near the feature point (i.e., the pixel position) with a vector, and the vector may also be referred to as a descriptor of the feature point. To sum up, the local descriptor to be measured is a feature vector for representing an image block where the feature point is located.

The characteristic points can be extracted from the target image by using algorithms such as ORB, SIFT, SURF and the like, and the local descriptors to be detected corresponding to the characteristic points are determined. A deep learning algorithm (such as SuperPoint, DELF, D2-Net, etc.) may also be used to extract feature points from the target image and determine the local descriptor to be detected corresponding to the feature points, which is not limited to this, as long as the feature points can be obtained and the local descriptor to be detected can be determined.

Step S16, for each feature point corresponding to the target image, determining a distance, such as an euclidean distance, between the local descriptor to be detected corresponding to the feature point and the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image, that is, calculating the euclidean distance between two feature vectors.

Referring to the above embodiment, for each frame of sample image, the three-dimensional visual map includes the sample local descriptor corresponding to each three-dimensional map point corresponding to the sample image, and after the candidate sample image is obtained, the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image may be obtained from the three-dimensional visual map.

After each feature point corresponding to the target image is obtained, the distance between the local descriptor to be detected corresponding to the feature point and the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image is determined.

Step S17, aiming at each feature point, selecting a target three-dimensional map point from the three-dimensional map points corresponding to the candidate sample image based on the distance between the local descriptor to be detected corresponding to the feature point and the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image; and the distance between the local descriptor to be detected and the sample local descriptor corresponding to the target three-dimensional map point is the minimum distance, and the minimum distance is smaller than the distance threshold.

For example, assuming that the candidate sample image corresponds to the three-dimensional map point 1, the three-dimensional map point 2, and the three-dimensional map point 3, the distance 1 between the local descriptor to be measured and the sample local descriptor corresponding to the three-dimensional map point 1 may be calculated, the distance 2 between the local descriptor to be measured and the sample local descriptor corresponding to the three-dimensional map point 2 may be calculated, and the distance 3 between the local descriptor to be measured and the sample local descriptor corresponding to the three-dimensional map point 3 may be calculated.

In one possible embodiment, if the distance 1 is the minimum distance, the three-dimensional map point 1 is selected as the target three-dimensional map point. Or, if the distance 1 is smaller than the distance threshold, the distance 2 is smaller than the distance threshold, but the distance 3 is not smaller than the distance threshold, both the three-dimensional map point 1 and the three-dimensional map point 2 may be selected as the target three-dimensional map point. Or, if the distance 1 is the minimum distance and the distance 1 is smaller than the distance threshold, the three-dimensional map point 1 may be selected as the target map point, but if the distance 1 is the minimum distance and the distance 1 is not smaller than the distance threshold, the target three-dimensional map point cannot be selected, that is, the global positioning fails.

And aiming at each characteristic point of the target image, selecting a target three-dimensional map point corresponding to the characteristic point from the candidate sample image corresponding to the target image to obtain the matching relation between the characteristic point and the target three-dimensional map point.

And step S18, determining a global positioning pose in the three-dimensional visual map corresponding to the target image based on the plurality of feature points corresponding to the target image and the target three-dimensional map points corresponding to the plurality of feature points.

For example, the feature point 1 of the target image corresponds to the three-dimensional map point 1, the feature point 2 of the target image corresponds to the three-dimensional map point 2, and so on, thereby obtaining a plurality of matching relationship pairs, each matching relationship pair includes a two-dimensional feature point and a three-dimensional map point, the two-dimensional feature point represents a two-dimensional position in the target image, and the three-dimensional map point represents a three-dimensional position in the three-dimensional visual map, that is, the matching relationship pair includes a mapping relationship from the two-dimensional position to the three-dimensional position, that is, a mapping relationship from the two-dimensional position in the target image to the three-dimensional position in the three-dimensional visual map.

And if the total number of the multiple matching relationship pairs does not meet the number requirement, the fact that the global positioning pose in the three-dimensional visual map corresponding to the target image cannot be determined based on the multiple matching relationship pairs is shown. If the total number of the plurality of matching relationship pairs meets the number requirement (that is, the total number reaches a preset number value), it indicates that the global positioning pose in the three-dimensional visual map corresponding to the target image can be determined based on the plurality of matching relationship pairs, that is, the global positioning pose in the three-dimensional visual map corresponding to the target image can be determined based on the plurality of matching relationship pairs.

For example, a PnP (global NPoint, n-point Perspective) algorithm may be used to calculate the global positioning pose of the target image in the three-dimensional visual map, and the calculation method is not limited. For example, the input data of the PnP algorithm is a plurality of pairs of matching relationships, and for each pair of matching relationships, the pair of matching relationships includes a two-dimensional position in the target image and a three-dimensional position in the three-dimensional visual map, and based on the plurality of pairs of matching relationships, the position and pose of the target image in the three-dimensional visual map, that is, the global positioning position and pose, can be calculated by using the PnP algorithm.

In a possible implementation manner, after obtaining the plurality of matching relationship pairs, an effective matching relationship pair may be found from the plurality of matching relationship pairs. Based on the effective matching relation pairs, the global positioning pose of the target image in the three-dimensional visual map can be calculated by adopting a PnP algorithm. For example, a RANdom SAmple Consensus (RANdom SAmple Consensus) detection algorithm may be adopted to find a valid pair of matching relationships from all pairs of matching relationships, which is not limited in this process.

In conclusion, the global positioning pose of the terminal device in the three-dimensional visual map can be determined.

According to the technical scheme, the three-dimensional visual map of the target scene can be constructed, the terminal equipment of the target scene is globally positioned based on the three-dimensional visual map, and the terminal equipment is accurately positioned. The three-dimensional visual map can be constructed by adopting the panoramic image of the target scene, the panoramic image has a larger field angle, repeated data acquisition of the target scene is avoided, and the data acquisition efficiency is improved. When the three-dimensional map points are determined, the small hole images can be obtained in a projection mode of the virtual camera, pose constraints of the virtual camera are added, the three-dimensional map points are determined by adopting a virtual camera binding optimization strategy, and the map building robustness, the map building efficiency and the map building precision are improved. By using the panoramic camera to realize the SFM reconstruction process, the accuracy and robustness of the image construction can be improved, and the image construction accuracy can be improved by adding a multi-camera binding optimization strategy to constrain the fixed connection relation between the virtual cameras. N virtual cameras which are fixedly connected with each other are obtained according to the spherical projection relation, the fixedly connected constraint is added into the reconstruction process, the constraint can improve the registration precision of the multiple cameras, and the proportion of error registration is reduced. In addition, the concept of the virtual aperture camera fixed connection constraint binding optimization can be expanded to a multi-camera system with hardware fixed connection, the algorithm has universality, only the camera fixed connection external reference needs to be modified, and the details are not repeated.

Based on the same application concept as the method, the embodiment of the present application provides a map building apparatus, as shown in fig. 5, for a structural diagram of the map building apparatus, the apparatus may include:

an obtaining module 51, configured to obtain a panoramic image of a target scene;

a generating module 52, configured to generate a first pinhole image corresponding to the first virtual camera based on the panoramic image; the position of the first virtual camera is the sphere center position of a visual spherical coordinate system, and the initial posture of the first virtual camera is any posture taking the sphere center position as the center;

a determining module 53, configured to determine a rotation matrix between a target pose of a second virtual camera and the initial pose, and determine a reference matrix between the target pose and the initial pose based on the rotation matrix; the position of the second virtual camera is the sphere center position of the visual spherical coordinate system, and the target posture is obtained by rotating the initial posture around the coordinate axis of the visual spherical coordinate system;

the generating module 52 is further configured to determine a second pinhole image corresponding to the second virtual camera based on the rotation matrix; the determining module 53 is further configured to select a two-dimensional feature point corresponding to an actual position of the target scene from the first pinhole image and the second pinhole image, and determine a three-dimensional map point corresponding to the actual position based on the two-dimensional feature point and the external reference matrix; and constructing a three-dimensional visual map of the target scene based on the plurality of three-dimensional map points of the target scene.

For example, the generating module 52 is specifically configured to, when generating the first pinhole image corresponding to the first virtual camera based on the panoramic image: generating a spherical view image corresponding to the spherical view coordinate system based on the panoramic image; and generating a first small hole image corresponding to the first virtual camera based on the view spherical surface image.

For example, the generating module 52 is specifically configured to, when generating the spherical view image corresponding to the spherical view coordinate system based on the panoramic image: determining a mapping relation between longitude and latitude coordinates in the spherical view image and rectangular coordinates in the panoramic image based on the width and the height of the panoramic image; aiming at each longitude and latitude coordinate in the spherical view image, determining a rectangular coordinate corresponding to the longitude and latitude coordinate from the panoramic image based on the mapping relation, and determining a pixel value of the longitude and latitude coordinate based on a pixel value of the rectangular coordinate; and generating the apparent spherical image based on the pixel value of each longitude and latitude coordinate in the apparent spherical image.

For example, the generating module 52 is specifically configured to, when generating the first pinhole image corresponding to the first virtual camera based on the spherical view image: determining a center point coordinate of a first small hole image based on the width and the height of the first small hole image, and determining a mapping relation between a rectangular coordinate in the first small hole image and a longitude and latitude coordinate in the spherical viewing surface image based on the center point coordinate and a target distance, wherein the target distance is a distance between the center point of the first small hole image and the spherical center position of the spherical viewing surface coordinate system; aiming at each rectangular coordinate in the first small hole image, determining a longitude and latitude coordinate corresponding to the rectangular coordinate from the spherical viewing surface image based on the mapping relation, and determining a pixel value of the rectangular coordinate based on a pixel value of the longitude and latitude coordinate; the first pinhole image is generated based on the pixel values of each rectangular coordinate in the first pinhole image.

For example, the determining module 53 is specifically configured to determine a rotation matrix between the target pose and the initial pose of the second virtual camera: determining a first rotation angle in a first coordinate axis direction between the target posture and the initial posture, and determining a first sub-rotation matrix in the first coordinate axis direction based on the first rotation angle; determining a second rotation angle in a second coordinate axis direction between the target posture and the initial posture, and determining a second sub-rotation matrix in the second coordinate axis direction based on the second rotation angle; determining a third rotation angle in a third coordinate axis direction between the target posture and the initial posture, and determining a third sub-rotation matrix in the third coordinate axis direction based on the third rotation angle; determining a rotation matrix between the target pose and the initial pose based on the first, second, and third sub-rotation matrices.

For example, the determining module 53 is specifically configured to, when determining the external reference matrix between the target pose and the initial pose based on the rotation matrix: determining a translation matrix between the first virtual camera and the second virtual camera; determining the external parameter matrix based on the rotation matrix and the translation matrix.

For example, the determining module 53 is specifically configured to, when determining the three-dimensional map point corresponding to the actual position based on the two-dimensional feature point and the external reference matrix: determining a target loss value of the configured loss function; determining a projection function value between a coordinate system of a virtual camera and a coordinate system of a pinhole image based on the target loss value and the two-dimensional feature points corresponding to the actual positions; and determining the three-dimensional map point corresponding to the actual position based on the external reference matrix and the projection function value.

Illustratively, the three-dimensional visual map comprises a sample global descriptor corresponding to a sample image, a three-dimensional map point corresponding to the sample image and a sample local descriptor corresponding to the three-dimensional map point; the sample image is an aperture image selected from a first aperture image and a second aperture image; the device also comprises a positioning module, a processing module and a processing module, wherein the positioning module is used for acquiring a target image of the terminal equipment in a target scene in the global positioning process of the terminal equipment; selecting candidate sample images from the multi-frame sample images based on the similarity between the target image and the multi-frame sample images corresponding to the three-dimensional visual map; acquiring a plurality of feature points from the target image; for each feature point, determining a target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image; and determining a global positioning pose in the three-dimensional visual map corresponding to the target image based on the plurality of feature points and the target three-dimensional map points corresponding to the plurality of feature points.

For example, the positioning module is configured to, based on the similarity between the target image and the multi-frame sample images corresponding to the three-dimensional visual map, specifically: determining a global descriptor to be detected corresponding to the target image, and determining the distance between the global descriptor to be detected and a sample global descriptor corresponding to each frame of sample image corresponding to the three-dimensional visual map; selecting a candidate sample image from the multi-frame sample images based on the distance between the global descriptor to be detected and each sample global descriptor; the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is the minimum distance; or the distance between the global descriptor to be detected and the sample global descriptor corresponding to the candidate sample image is smaller than a distance threshold.

For example, when the positioning module determines a target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image, the positioning module is specifically configured to: determining a local descriptor to be tested corresponding to the feature point, wherein the local descriptor to be tested is used for representing a feature vector of an image block where the feature point is located, and the image block is located in the target image; determining the distance between the local descriptor to be tested and the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image; selecting target three-dimensional map points from the three-dimensional map points corresponding to the candidate sample images based on the distance between the local descriptor to be detected and each sample local descriptor; and the distance between the local descriptor to be detected and the sample local descriptor corresponding to the target three-dimensional map point is a minimum distance, and the minimum distance is smaller than a distance threshold value.

Based on the same application concept as the method, the embodiment of the present application provides a map building apparatus, which may include: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine executable instructions to implement the mapping method disclosed in the above example of the present application.

Based on the same application concept as the method, embodiments of the present application further provide a machine-readable storage medium, where a plurality of computer instructions are stored, and when the computer instructions are executed by a processor, the map construction method disclosed in the above example of the present application can be implemented.

The machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A map construction method, characterized in that the method comprises:

2. The method of claim 1,

the generating a first pinhole image corresponding to a first virtual camera based on the panoramic image comprises:

generating a spherical view image corresponding to the spherical view coordinate system based on the panoramic image;

and generating a first small hole image corresponding to the first virtual camera based on the view spherical surface image.

3. The method of claim 2,

the generating of the view spherical image corresponding to the view spherical coordinate system based on the panoramic image comprises:

determining a mapping relation between longitude and latitude coordinates in the spherical view image and rectangular coordinates in the panoramic image based on the width and the height of the panoramic image; aiming at each longitude and latitude coordinate in the spherical view image, determining a rectangular coordinate corresponding to the longitude and latitude coordinate from the panoramic image based on the mapping relation, and determining a pixel value of the longitude and latitude coordinate based on a pixel value of the rectangular coordinate;

and generating the apparent spherical image based on the pixel value of each longitude and latitude coordinate in the apparent spherical image.

4. The method of claim 2, wherein generating the first pinhole image corresponding to the first virtual camera based on the view sphere image comprises:

determining a center point coordinate of a first small hole image based on the width and the height of the first small hole image, and determining a mapping relation between a rectangular coordinate in the first small hole image and a longitude and latitude coordinate in a spherical viewing surface image based on the center point coordinate and a target distance, wherein the target distance is a distance between the center point of the first small hole image and the spherical center position of the spherical viewing surface coordinate system; for each rectangular coordinate in the first small hole image, determining a longitude and latitude coordinate corresponding to the rectangular coordinate from the spherical-view image based on the mapping relation, and determining a pixel value of the rectangular coordinate based on a pixel value of the longitude and latitude coordinate;

the first pinhole image is generated based on the pixel values of each rectangular coordinate in the first pinhole image.

5. The method of claim 1,

the determining a rotation matrix between a target pose of a second virtual camera and the initial pose comprises:

determining a first rotation angle in a first coordinate axis direction between the target posture and the initial posture, and determining a first sub-rotation matrix in the first coordinate axis direction based on the first rotation angle;

determining a second rotation angle in a second coordinate axis direction between the target posture and the initial posture, and determining a second sub-rotation matrix in the second coordinate axis direction based on the second rotation angle;

determining a third rotation angle in a third coordinate axis direction between the target posture and the initial posture, and determining a third sub-rotation matrix in the third coordinate axis direction based on the third rotation angle;

determining a rotation matrix between the target pose and the initial pose based on the first, second, and third sub-rotation matrices.

6. The method of claim 1, wherein the determining a foreign matrix between the target pose and the initial pose based on the rotation matrix comprises:

determining a translation matrix between the first virtual camera and the second virtual camera;

determining the external parameter matrix based on the rotation matrix and the translation matrix.

7. The method of claim 1, wherein determining the three-dimensional map point corresponding to the actual location based on the two-dimensional feature points and the external reference matrix comprises:

determining a target loss value of the configured loss function;

determining a projection function value between a coordinate system of the virtual camera and a coordinate system of the pinhole image based on the target loss value and the two-dimensional feature points corresponding to the actual positions;

and determining the three-dimensional map point corresponding to the actual position based on the external reference matrix and the projection function value.

8. The method of claim 1,

the three-dimensional visual map comprises a sample global descriptor corresponding to a sample image, a three-dimensional map point corresponding to the sample image and a sample local descriptor corresponding to the three-dimensional map point; wherein the sample image is an aperture image selected from the first aperture image and the second aperture image;

after the building the three-dimensional visual map of the target scene, the method further comprises:

in the global positioning process of the terminal equipment, acquiring a target image of the terminal equipment in the target scene; selecting candidate sample images from the multi-frame sample images based on the similarity between the target image and the multi-frame sample images corresponding to the three-dimensional visual map;

acquiring a plurality of feature points from the target image; for each feature point, determining a target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image;

and determining a global positioning pose in the three-dimensional visual map corresponding to the target image based on the plurality of feature points and the target three-dimensional map points corresponding to the plurality of feature points.

9. The method of claim 8,

the selecting a candidate sample image from the multi-frame sample images based on the similarity between the target image and the multi-frame sample images corresponding to the three-dimensional visual map includes:

determining a global descriptor to be detected corresponding to the target image, and determining the distance between the global descriptor to be detected and a sample global descriptor corresponding to each frame of sample image corresponding to the three-dimensional visual map;

selecting a candidate sample image from the multi-frame sample images based on the distance between the global descriptor to be detected and each sample global descriptor; the distance between the global descriptor to be detected and the sample global descriptor corresponding to the candidate sample image is the minimum distance; or the distance between the global descriptor to be detected and the sample global descriptor corresponding to the candidate sample image is smaller than a distance threshold value.

10. The method of claim 8, wherein determining the target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image comprises:

determining a local descriptor to be tested corresponding to the feature point, wherein the local descriptor to be tested is used for representing a feature vector of an image block where the feature point is located, and the image block is located in the target image;

determining the distance between the local descriptor to be tested and the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image; selecting target three-dimensional map points from the three-dimensional map points corresponding to the candidate sample images based on the distance between the local descriptor to be detected and each sample local descriptor;

and the distance between the local descriptor to be detected and the sample local descriptor corresponding to the target three-dimensional map point is a minimum distance, and the minimum distance is smaller than a distance threshold value.

11. A map building apparatus, characterized in that the apparatus comprises:

12. The apparatus of claim 11,

the generating module is specifically configured to, when generating a first pinhole image corresponding to a first virtual camera based on the panoramic image: generating a spherical view image corresponding to the spherical view coordinate system based on the panoramic image; generating a first pinhole image corresponding to the first virtual camera based on the view spherical image;

the generation module is specifically configured to, when generating the spherical view image corresponding to the spherical view coordinate system based on the panoramic image: determining a mapping relation between longitude and latitude coordinates in the spherical view image and rectangular coordinates in the panoramic image based on the width and the height of the panoramic image; aiming at each longitude and latitude coordinate in the spherical view image, determining a rectangular coordinate corresponding to the longitude and latitude coordinate from the panoramic image based on the mapping relation, and determining a pixel value of the longitude and latitude coordinate based on a pixel value of the rectangular coordinate; generating a spherical view image based on the pixel value of each longitude and latitude coordinate in the spherical view image;

the generation module is specifically configured to, when generating the first pinhole image corresponding to the first virtual camera based on the view spherical image: determining a center point coordinate of a first small hole image based on the width and the height of the first small hole image, and determining a mapping relation between a rectangular coordinate in the first small hole image and a longitude and latitude coordinate in the spherical viewing surface image based on the center point coordinate and a target distance, wherein the target distance is a distance between the center point of the first small hole image and the spherical center position of the spherical viewing surface coordinate system; aiming at each rectangular coordinate in the first small hole image, determining a longitude and latitude coordinate corresponding to the rectangular coordinate from the spherical viewing surface image based on the mapping relation, and determining a pixel value of the rectangular coordinate based on a pixel value of the longitude and latitude coordinate; generating a first pinhole image based on the pixel values of each rectangular coordinate in the first pinhole image;

wherein the determining module, when determining the rotation matrix between the target pose and the initial pose of the second virtual camera, is specifically configured to: determining a first rotation angle in a first coordinate axis direction between the target posture and the initial posture, and determining a first sub-rotation matrix in the first coordinate axis direction based on the first rotation angle; determining a second rotation angle in a second coordinate axis direction between the target posture and the initial posture, and determining a second sub-rotation matrix in the second coordinate axis direction based on the second rotation angle; determining a third rotation angle in a third coordinate axis direction between the target posture and the initial posture, and determining a third sub-rotation matrix in the third coordinate axis direction based on the third rotation angle; determining a rotation matrix between the target pose and the initial pose based on a first sub-rotation matrix, a second sub-rotation matrix, and a third sub-rotation matrix;

wherein the determining module is specifically configured to, when determining the external reference matrix between the target pose and the initial pose based on the rotation matrix: determining a translation matrix between the first virtual camera and the second virtual camera; determining the external parameter matrix based on the rotation matrix and the translation matrix;

the determining module is specifically configured to, when determining the three-dimensional map point corresponding to the actual position based on the two-dimensional feature point and the external reference matrix: determining a target loss value of the configured loss function; determining a projection function value between a coordinate system of a virtual camera and a coordinate system of a pinhole image based on the target loss value and the two-dimensional feature points corresponding to the actual positions; determining a three-dimensional map point corresponding to the actual position based on the external reference matrix and the projection function value;

the three-dimensional visual map comprises a sample global descriptor corresponding to a sample image, a three-dimensional map point corresponding to the sample image and a sample local descriptor corresponding to the three-dimensional map point; the sample image is an aperture image selected from a first aperture image and a second aperture image; the device further comprises: the positioning module is used for acquiring a target image of the terminal equipment in a target scene in the global positioning process of the terminal equipment; selecting candidate sample images from the multi-frame sample images based on the similarity between the target image and the multi-frame sample images corresponding to the three-dimensional visual map; acquiring a plurality of feature points from the target image; for each feature point, determining a target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image; determining a global positioning pose in the three-dimensional visual map corresponding to the target image based on the plurality of feature points and target three-dimensional map points corresponding to the plurality of feature points;

the positioning module is specifically configured to, based on the similarity between the target image and the multi-frame sample image corresponding to the three-dimensional visual map, select a candidate sample image from the multi-frame sample image: determining a global descriptor to be detected corresponding to the target image, and determining the distance between the global descriptor to be detected and a sample global descriptor corresponding to each frame of sample image corresponding to the three-dimensional visual map; selecting a candidate sample image from the multi-frame sample images based on the distance between the global descriptor to be detected and each sample global descriptor; the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is the minimum distance; or the distance between the global descriptor to be detected and the sample global descriptor corresponding to the candidate sample image is smaller than a distance threshold value;

the positioning module is specifically configured to, when determining a target three-dimensional map point corresponding to the feature point from the three-dimensional map points corresponding to the candidate sample image: determining a local descriptor to be tested corresponding to the feature point, wherein the local descriptor to be tested is used for representing a feature vector of an image block where the feature point is located, and the image block is located in the target image; determining the distance between the local descriptor to be tested and the sample local descriptor corresponding to each three-dimensional map point corresponding to the candidate sample image; selecting target three-dimensional map points from the three-dimensional map points corresponding to the candidate sample images based on the distance between the local descriptor to be detected and each sample local descriptor; and the distance between the local descriptor to be detected and the sample local descriptor corresponding to the target three-dimensional map point is a minimum distance, and the minimum distance is smaller than a distance threshold value.

13. A map building apparatus, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine executable instructions to perform the method steps of any of claims 1-10.