CN117036663A

CN117036663A - Visual positioning method, device and storage medium

Info

Publication number: CN117036663A
Application number: CN202310995098.8A
Authority: CN
Inventors: 刘小伟; 陈兵; 周俊伟; 王国毅
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2023-11-10
Also published as: CN114898084A; CN114898084B

Abstract

The embodiment of the application provides a visual positioning method, equipment and a storage medium, wherein the method comprises the steps of obtaining a front image shot by a front camera of first equipment and a rear image shot by a rear camera, and finding a front candidate frame similar to the front image and a rear candidate frame similar to the rear image; determining a point pair of a front image according to the front candidate frame, determining a point pair of a rear image according to the rear candidate frame, wherein the point pair of the front image comprises a pixel point of the front image and a space point of a projected pixel point, and the point pair of the rear image comprises the pixel point of the rear image and the space point of the projected pixel point; and determining the pose of the first device according to the point pairs of the front image and the rear image and the relative poses of the front camera and the rear camera of the first device. According to the scheme, the front image shot by the front camera and the relative pose between the front camera and the rear camera are utilized to determine the pose of the first equipment, so that the accuracy of visual positioning is improved.

Description

Visual positioning method, device and storage medium

The application is a divisional application of an application patent application with the application date of 2022, 04 month and 18 days, the application number of 202210415435.7 and the application name of 'visual positioning method, equipment and storage medium'.

Technical Field

The application relates to the technical field of augmented reality, in particular to a visual positioning method, visual positioning equipment and a storage medium.

Background

With the rapid development of software and hardware technologies, augmented reality (Augmented Reality, AR) technology implemented based on electronic devices (e.g., smart phones) is increasingly applied in industries such as education training, exhibition show, and the like.

The AR technology is a technology for displaying a virtual image at a specific position on a display screen based on the pose of the device itself. Through the application of the AR technology, the electronic equipment can combine the virtual image with the image of the real object in the equipment screen, so that the effect of virtual-real combination is achieved. The accuracy of the device pose has an important effect on the performance of the AR. The more accurate the pose of the device, the more natural the combination of the virtual image and the image of the object, the better the visual effect of the user, otherwise, the less accurate the pose of the device, the poorer the visual effect.

The pose of the equipment is generally determined by a visual positioning method at present, namely, the pose of the equipment is determined according to the current image shot by a camera of the equipment. The existing visual positioning method only utilizes images shot by a single camera (such as a rear camera of a mobile phone) on equipment, and has low accuracy.

Disclosure of Invention

Aiming at the problems, the application provides a visual positioning method, equipment and a storage medium, which are used for calculating the pose of the electronic equipment by combining photos shot by cameras in different directions of the electronic equipment, so that the accuracy of a calculation result is improved.

In order to achieve the above object, the present application provides the following technical solutions:

the first aspect of the present application provides a visual positioning method, comprising:

acquiring a front image and a rear image, wherein the front image is an image shot by a front camera of first equipment, and the rear image is an image shot by a rear camera of the first equipment;

searching for a prepositive candidate frame similar to the prepositive image and a postpositive candidate frame similar to the postpositive image in an offline map, wherein the offline map comprises multi-frame images shot by a second device;

determining a point pair of the front image according to the front candidate frame, and determining a point pair of the rear image according to the rear candidate frame, wherein the point pair of the front image comprises a pixel point of the front image and a space point of a pixel point of the front image projected, and the point pair of the rear image comprises a pixel point of the rear image and a space point of a pixel point of the rear image projected;

And determining the pose of the first device according to the point pairs of the front image, the point pairs of the rear image and the relative pose of the first device, wherein the relative pose of the first device is the relative pose of the front camera of the first device and the rear camera of the first device.

The first device may be an electronic device used by a user, that is, the user device may be, for example, a smart phone; the second device is an image capturing device, i.e. an acquisition device, dedicated to capturing images required for constructing an offline map.

The method has the beneficial effects that the pose of the first device is determined by utilizing the front image shot by the front camera and the relative pose between the front camera and the rear camera, so that the accuracy of visual positioning is improved.

In some optional embodiments, the searching in the offline map to obtain the pre-candidate frame similar to the pre-image and the post-candidate frame similar to the post-image includes:

determining an image, of images shot by a front-facing camera of the second device, of which the similarity with the front-facing image is larger than a preset similarity threshold value as a front-facing candidate frame;

and determining an image, of images shot by the rear camera of the second device, and the image, of which the similarity with the rear image is larger than a preset similarity threshold, as a rear candidate frame.

It will be appreciated that there may be a plurality of pre-candidate frames and post-candidate frames. For example, a multi-frame image having a similarity to the front image of greater than 0.8 may be selected as the front candidate frame, and a multi-frame image having a similarity to the rear image of greater than 0.8 may be selected as the rear candidate frame.

In some optional embodiments, before the determining the point pair of the front image according to the front candidate frame and the determining the point pair of the rear image according to the rear candidate frame, the method further includes:

identifying an image combination in the pre-candidate frame and the post-candidate frame, wherein the image combination comprises a frame of the pre-candidate frame and a frame of the post-candidate frame which are shot by the second equipment at the same time;

calculating relative pose errors of the image combinations for each image combination, wherein the relative pose errors of the image combinations are errors of relative poses of the front candidate frame and the rear candidate frame of the image combinations and relative poses of the second equipment, and the relative poses of the second equipment are pre-calibrated relative poses of a front camera and a rear camera of the second equipment;

the image combinations are reordered in order of the corresponding relative pose errors from small to large.

In some optional embodiments, after the identifying the image combination in the pre-candidate frame and the post-candidate frame, the method further includes:

and deleting the pre-candidate frames and the post-candidate frames which do not form the image combination.

In some optional embodiments, the determining the point pair of the pre-image according to the pre-candidate frame and the determining the point pair of the post-image according to the post-candidate frame includes:

and determining the point pairs of the front image according to the front N reordered front candidate frames, and determining the point pairs of the rear image according to the front N reordered rear candidate frames, wherein N is a preset positive integer.

The embodiment has the advantages that the images with smaller relative pose errors are reordered from small to large and are preferentially selected for combination, so that the accuracy of the visual positioning result can be improved.

In some optional embodiments, the determining the point pair of the pre-image according to the pre-candidate frame includes:

extracting image features of the pre-candidate frame and image features of the pre-image;

determining the pixel points of the front candidate frame matched with the pixel points of the front image by comparing the image characteristics of the front candidate frame with the image characteristics of the front image;

And determining the space point projected out of the pixel points of the pre-candidate frame and the pixel points of the pre-image matched with the pixel points of the pre-candidate frame as the point pair of the pre-image.

For example, the projection of the pixel B1 of the front candidate frame as the spatial point B is recorded in the offline map, and the pixel B2 of the front image and the pixel B1 of the front candidate frame are determined to be matched by comparing the image features of the front candidate frame and the image features of the front image, so that the pixel B2 and the spatial point B can be further determined to be one point pair of the front image, that is, the projection of the pixel B2 of the front image as the spatial point B.

In some optional embodiments, the determining the pose of the first device according to the point pair of the front image, the point pair of the rear image, and the relative pose of the first device includes:

calculating according to the point pairs of the front-facing images to obtain pose estimated values of the front-facing cameras of the first equipment;

calculating according to the point pairs of the rear image to obtain a pose estimated value of a rear camera of the first device;

optimizing the pose estimation value of the front camera of the first equipment and the pose estimation value of the rear camera of the first equipment based on a nonlinear optimization method to obtain the pose of the first equipment, wherein an optimization function of the nonlinear optimization method at least comprises the relative pose error of the first equipment;

The relative pose error of the first device is an error between the relative pose estimated value of the first device, which is calculated according to the pose estimated value of the front camera of the first device and the pose estimated value of the rear camera of the first device, and the pre-calibrated relative pose of the first device, which is the relative pose between the front camera and the rear camera of the first device.

In some alternative embodiments, the optimization function of the nonlinear optimization method includes a relative pose error of the first device, a re-projection error of the front image and a re-projection error of the rear image.

A second aspect of the application provides an electronic device comprising a memory, one or more processors;

the memory is used for storing a computer program;

the one or more processors are configured to execute a computer program, in particular to implement the visual positioning method provided by any one of the first aspects of the present application.

A third aspect of the present application provides a computer storage medium storing a computer program which, when executed, is particularly adapted to carry out the visual positioning method provided in any one of the first aspects of the present application.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 2 is a schematic view of an application scenario of a visual positioning service according to an embodiment of the present application;

FIG. 3 is a flowchart of a visual positioning method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of reordering pre-candidate frames and post-candidate frames according to an embodiment of the present application;

FIG. 5 is a schematic diagram of pairing pixel points and space points according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for calculating a pose of a user device according to an embodiment of the present application;

fig. 7 is an interaction schematic diagram of a user device and a cloud end according to an embodiment of the present application;

fig. 8 is a flowchart of another visual positioning method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include, for example, "one or more" such forms of expression, unless the context clearly indicates to the contrary. It should also be understood that in embodiments of the present application, "one or more" means one, two, or more than two; "and/or", describes an association relationship of the association object, indicating that three relationships may exist; for example, a and/or B may represent: a alone, a and B together, and B alone, wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship.

For convenience of description, terms that may be related to the present application will be briefly described first.

The pose of the device includes the pose of one or more cameras on the electronic device, i.e., includes the position and pose of one or more cameras on the electronic device. Taking a smart phone with a front camera and a rear camera as an example, the pose of the smart phone includes the front camera pose, that is, the position and the pose of the front camera of the phone, and the rear camera pose, that is, the position and the pose of the rear camera of the phone. The pose of the camera can be represented by a translation matrix T and a rotation matrix R.

The position of the camera can be noted as a translation matrix T:

where X, Y and Z denote coordinates of the camera position in the space rectangular coordinate system, the pose of the camera may be denoted as a rotation matrix R, where R is a 3×3 matrix, and is used to denote the rotation angle of the direction of the camera (the direction of the optical axis of the camera may also be understood as the direction of the optical axis of the camera) with respect to the X, Y and Z axes of the space rectangular coordinate system. In some application scenarios, the rotation matrix and the translation matrix may also be combined into an extrinsic matrix P of the form:

the pose of the image. For a frame of image captured by an electronic device, the pose of the electronic device when the frame of image is captured may be referred to as the pose of the frame of image.

Camera references are used to describe some intrinsic parameters inside the camera, such as camera focal length, distortion parameters, etc. Knowing the coordinates of a point in space in the camera coordinate system, the plane coordinates of the pixel point projected by the point on the photo can be calculated through the camera internal parameters. The camera coordinate system is a space rectangular coordinate system determined by taking the optical center of the camera as an origin and the optical axis of the camera as a Z axis. Camera references are generally calibrated in advance and stored in a storage medium of an electronic device (such as a mobile phone with a photographing function, a digital camera, etc.) with a camera before leaving the factory. For example, for a smart phone with a front camera and a rear camera, the memory of the phone stores pre-calibrated camera references for the front camera and camera references for the rear camera.

An offline map. To achieve visual localization of an electronic device in a particular scene, an offline map of the scene needs to be pre-constructed. An offline map of a scene may include the following information: and continuously shooting the obtained multi-frame images in the scene, the pose of each frame of image, the projection relation of pixel points in the image and the parameters of the acquisition equipment.

The acquisition equipment is electronic equipment for shooting multi-frame images in the offline map. Parameters of the acquisition device may include camera intrinsic parameters of one or more cameras on the acquisition device, relative pose between the one or more cameras on the acquisition device. Generally, the relative pose of the acquisition device is pre-calibrated by the manufacturer and recorded in the acquisition device before the acquisition device leaves the factory.

The relative pose between the two cameras can be understood as the position and pose of one camera in the camera coordinate system of the other camera, and the relative pose between the two cameras can be represented by a rotation matrix and a translation matrix.

Generally, the method for constructing the offline map may be that an acquisition device is used to capture a multi-frame image of a certain scene, then the acquisition device uploads the captured multi-frame image and its own parameters to a cloud end, a server of the cloud end processes the image uploaded by the acquisition device through algorithms such as three-dimensional reconstruction, and the like to obtain the pose of each frame of image and the projection relationship of pixel points in the image, and finally the data are stored in the cloud end as the offline map of the corresponding scene.

Taking a market A as an example, the acquisition equipment shoots multi-frame images in the market A, then the multi-frame images of the market A and parameters of the acquisition equipment are uploaded to a cloud end, a server of the cloud end processes the images uploaded by the acquisition equipment through algorithms such as three-dimensional reconstruction, so that pose of each frame of images and projection relations of pixel points in the images are obtained, and finally the images of the market A, the pose of the images, the projection relations of the pixel points in the images and the parameters of the acquisition equipment are stored in the cloud end as an offline map of the market A.

Spatial points (also called 3D points) can be understood as points of the physical surface. In contrast, a pixel point on an image may be referred to as a 2D point.

Pixel point projection relationship. The image of the object on the photo can be understood as the projection of the object actually existing in the space on the imaging plane of the camera, and correspondingly, each pixel point forming the object image on the photo is displayed as the projection of a plurality of space points on the object surface. For example, the image of a car on a photograph can be considered as the projection of the car body on the imaging plane, and then the individual pixels that make up the image of the car are obviously created by the projection of points of the car surface on the imaging plane. The pixel projection relationship in the offline map is used for explaining the pixel in the photograph shot by the offline map, which space point on the surface of the object in the specific scene is projected, and the coordinates (positions) of the space points where the pixel is projected.

For example, for the pixel point A1 of the image, the projection relationship of the pixel point may be that the pixel point A1 is obtained by projection of the spatial point A1-3D with the spatial coordinates (Xa, ya, za). In other words, the pixel point A1 on the image is a projection of a spatial point in space where the object surface is located at the coordinates (Xa, ya, za).

In order to reduce the data amount of the off-line map as a whole, the above projection relationship is generally not recorded for each pixel, but only for a part of representative pixels. For example, for a 658 x 658 image, the above projection relationship may only record the projection relationship of 30 pixels, i.e., record what spatial points are projected from the object surface for 30 pixels, and what the coordinates of these spatial points are.

The world coordinate system is a space rectangular coordinate system established by taking any one fixed point in space as an origin. The world coordinate systems used for different offline maps are generally different. The origin of the world coordinate system of an offline map may be designated as the position of a camera optical center of the acquisition device when capturing the first frame of image in the offline map, for example, as the position of a front camera optical center of the mobile phone when capturing the first frame of image. The Z-axis of the world coordinate system may be designated as an axis perpendicular to the earth's surface, with the direction in the perpendicular earth's surface being the positive direction of the Z-axis.

For example, after capturing consecutive multi-frame images of mall a, for example, 1 to N frames of images of mall a, using the acquisition device, an offline map of mall a may be created from these images. The offline map of the market A comprises 1 to N frames of images of the market A, the pose of each frame of image in the 1 to N frames of images, the camera internal parameters of each camera on the acquisition equipment, and a plurality of pixel points in the 1 to N frames of images, wherein the projection of the spatial points in which positions are located in the market A is realized.

In the embodiment of the present application, unless otherwise specified, the pose of the camera and the coordinates of the spatial points on the surface of the object are all referenced by the world coordinate system. In other words, the pose of the camera refers to the position and pose of the camera in the world coordinate system; the coordinates of the spatial points refer to the coordinates of the spatial points in the world coordinate system.

The embodiment of the application provides electronic equipment 100, which can be specifically mobile phones, tablet computers and other equipment.

As shown in fig. 1, the electronic device 100 may include: processor 110, external memory 120, internal memory (also referred to as "memory") 121, universal serial bus (universal serial bus, USB) interface 130, charge management module 140, power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headset interface 170D, sensor module 180, keys 190, motor 191, indicator 192, camera 193, display 194, and subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a communication processor (communication processor, CP, which may also be referred to as a modem), a graphics processor (graphics processing unit, GPU), and the like.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

The camera 193 may include one or more cameras, for example, the camera 193 may include one or more rear cameras mounted on the back of the electronic device and one or more front cameras mounted on the front of the electronic device (the front side on which the screen is located).

The display 194 may comprise one or more screens. The electronic device displays video, images, and a series of graphical user interfaces through a screen. In some embodiments, the display screen 194 may be combined with a touch screen, and a user may interact with the electronic device by clicking or sliding (with a finger or stylus) on the touch screen.

In the embodiment of the present application, the electronic device 100 may have an AR function by installing a specific application, and after the AR function is started, the electronic device 100 captures an image of a current scene through the camera 193 and displays the image of the current scene on the display screen 194, and at the same time, the electronic device 100 displays a preset virtual image (for example, an image of a virtual character) on the screen according to its pose. Thus, the electronic device 100 may present interactions of the virtual image with the physical image in the current scene on the screen.

In an embodiment of the present application, external memory 120 may store computer instructions.

The processor 110 may execute computer instructions to cause the electronic device 100 to implement the method of visual localization provided by any of the embodiments of the present application.

The method for visual positioning of the present application needs to use a front image and a rear image of the electronic device 100, where the front image is a photograph taken by a front camera, and the rear image is a photograph taken by a rear camera.

The above is a specific description of the embodiment of the present application using the electronic device 100 as an example. It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. The electronic device 100 may have more or fewer components than shown in the figures, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The electronic device provided by the embodiment of the application may be a User Equipment (UE), for example, a mobile terminal (for example, a user mobile phone), a tablet computer, a desktop, a laptop, a handheld computer, a netbook, a personal digital assistant (personal digital assistant, PDA) and other devices.

In addition, an operating system is run on the components. Such as the iOS operating system developed by apple corporation, the Android open source operating system developed by google corporation, the Windows operating system developed by microsoft corporation, and the system. An operating program may be installed on the operating system.

Illustratively, the electronic device 100 may be installed with a visual location service (Visual Positioning Service, VPS) component, and the electronic device 100 may perform the method of visual location provided by any one of the embodiments of the present application by running the VPS component.

In the embodiment of the application, according to different functions, two types of equipment may be involved, one type is called acquisition equipment, and refers to image pickup equipment for shooting multi-frame images in an offline map, and the other type is called user equipment, and refers to equipment for determining the pose by the visual positioning method. Both the acquisition device and the user device may have the structure of the electronic device as in fig. 1.

In general, the acquisition device and the user device are not the same device, and the following description will refer to the acquisition device and the user device not being the same device. Of course, in some alternative embodiments, the acquisition device and the user device may be the same device, and the visual positioning method of the present application is not so affected, nor is the application limited thereto.

In combination with the foregoing example, a manufacturer providing related services may take a series of pictures of the mall a in advance with a professional camera device, thereby constructing an offline map of the mall a, and the user uses the AR function of the smart phone in the mall a, and during the period when the AR function is turned on, the smart phone of the user determines the pose of the smart phone (i.e., the poses of the front and rear cameras of the smart phone) through the visual positioning method of the present application. In the above scenario, the camera equipment used by the manufacturer is the acquisition equipment, and the smart phone of the user is the user equipment.

Fig. 2 is a schematic diagram of an AR function of a mobile phone, which is a typical application scenario of the VPS.

The user equipment 200 opens an AR function of the user equipment 200, such as opening a pre-installed AR application, in response to an operation of the user.

After the AR function is turned on, the user device 200 captures an image of the current scene in real time through the rear camera, and displays the image of the current scene on the screen. As shown in fig. 2, the image captured by the user equipment 200 includes a physical image 201.

On this basis, the user equipment 200 further displays a preset virtual image at a specific position in the image of the current scene, for example, the user equipment 200 may display a virtual character, that is, the virtual image 202 shown in fig. 2, on top of the physical image 201. In some alternative embodiments, the virtual image on the screen may be moved or acted upon specifically according to the user's operation or movement of the user device, thereby enabling interaction based on the user and the virtual image in a real scene.

It will be appreciated that in the scenario shown in fig. 2, in order to accurately display virtual image 202 over physical image 201, user device 200 needs to obtain the position of the real-time captured physical image on the screen. The position of the physical image on the screen may be determined according to the pose of the rear camera of the user equipment 200 when the image is captured, and the position of the actual object, such as the cup of the physical image 201 in fig. 2, in the scene.

The position of the actual object in the scene can be determined by the projection relationship of the pixel points in the offline map, and the pose of the rear camera of the user equipment 200 needs to be obtained by the visual positioning method of the application.

Referring to fig. 3, a flowchart of a visual positioning method provided by the present application is shown.

S301a, capturing a front image with a front camera.

The front camera in step S301a refers to the front camera of the user equipment.

S301b, shooting a rear-mounted image by using a rear-mounted camera.

The rear camera in step S301b refers to the rear camera of the user equipment.

It should be noted that step S301a and step S301b are performed simultaneously by the user equipment.

In combination with the foregoing example, the user device may be a smart phone with AR functionality carried by a user. After the user performs a mall, the AR function of the mobile phone is turned on, and at this time, the mobile phone performs steps S301a and S301b for determining its pose, that is, calls its front camera and rear camera to take a picture at the same time, so as to obtain a corresponding front image and rear image.

In some alternative embodiments, the user device may display the rear image on the screen, while the front image is not typically displayed on the screen.

In a scenario such as the AR function shown in fig. 2, a photograph taken by a rear camera of the handset may be displayed on the screen so as to be combined with a virtual image on the screen.

S302a, searching to obtain a pre-candidate frame.

The pre-candidate frames refer to a plurality of frame images similar to the pre-image of the user equipment in step S301a in the pre-image of the offline map. In other words, step S302a corresponds to searching for a plurality of frame images similar to the front image of the user equipment in the front image of the offline map.

The front image of the offline map refers to an image captured by a front camera of the acquisition device in an offline manner. That is, in this embodiment, since the front-facing image and the rear-facing image captured by the user device are combined when determining the pose of the user device, correspondingly, the acquisition device used when capturing the image for constructing the offline map is also provided with the front-facing camera and the rear-facing camera, and the acquisition device captures images with the front-facing camera and the rear-facing camera, respectively, during the capturing process, wherein the multi-frame image captured by the front-facing camera of the acquisition device is recorded as the front-facing image of the offline map, and the multi-frame image captured by the rear-facing camera of the acquisition device is recorded as the rear-facing image of the offline map.

In some alternative embodiments, the implementation of step S302a may be to calculate the similarity between the front image of the offline map and the front image of the user device for each frame one by one, and then determine the front image of the offline map in which the similarity is greater than the preset similarity threshold as the front candidate frame. For example, if the similarity between the front image of the offline map of one frame and the front image of the user equipment is greater than 70%, determining that the front image of the offline map of one frame is a front candidate frame.

In some optional embodiments, the implementation of step S302a may also be that the similarity between the front image of the offline map of each frame and the front image of the user equipment is calculated one by one, and the front images of the offline maps of the previous N frames are selected according to the sequence of the front images of the offline maps from high to low, that is, the front image of the offline map of the previous N frames with the highest similarity is selected as the N frame front candidate frames. N is a preset positive integer, and for example, N may be set to 20, i.e. the first 20 frames with highest similarity in all the front images of the offline map are selected as 20 frame front candidate frames.

The above is merely two exemplary embodiments of the step S302a, and the step S302a may be implemented in other manners in practical applications, which is not limited in this embodiment.

In step S302a, the similarity between the front image of the user equipment and the front image of the offline map may be calculated by any image similarity algorithm in the related technical fields such as image processing technology, for example, a structural similarity (structural similarity) algorithm, a perceptual hash algorithm (perceptual hash algorithm), a histogram method, etc., and the embodiment does not limit the specifically used image similarity algorithm.

As an example, the similarity between the front image of the offline map of each frame and the front image of the user equipment may be calculated in step S302a by using the histogram method, and the process of calculating the image similarity by using the histogram method is described below by taking the process of calculating the front image of the offline map of one frame as an example:

and a step a, unifying the sizes of the front image of the offline map and the front image of the user equipment. For example, if the size of the front image of the offline map is larger than the front image of the user device, the size of the front image of the offline map may be reduced to be consistent with the size of the front image of the user device. It will be appreciated that if the front image of the offline map and the front image of the user device are originally the same size, step a may be performed directly without performing step b.

And b, converting the front image of the user equipment and the front image of the offline map into gray-scale images. If the front image of the user equipment and the front image of the offline map are originally gray-scale images, the step c can be directly executed by skipping the step b.

And c, determining a gray level histogram of the front image of the offline map and a gray level histogram of the front image of the user equipment.

Gray level histogram, which reflects the number of pixels for each gray level value in the image.

For example, assuming that the gray value range is 0 to 255, the number of pixels corresponding to each gray value in the front image of the offline map is counted sequentially, for example, 30 pixels with gray value of 0 and 30 pixels with gray value of 1, 45 pixels with gray value of 2, 60 pixels with gray value of 3, … …, and the like, and these data are represented in the form of histograms, that is, the gray histograms of the front image of the offline map.

And d, respectively normalizing the gray level histogram of the front image of the offline map and the gray level histogram of the front image of the user equipment to obtain a normalized histogram of the front image of the offline map and a normalized histogram of the front image of the user equipment.

By normalizing the gray level histogram, it can be understood that the number of pixels corresponding to each gray level value in the gray level histogram is divided by the total number of pixels in the image, so as to obtain the proportion of pixels corresponding to each gray level value in the image, in other words, the normalized gray level histogram reflects the proportion of pixels of each gray level value in the image.

Combining the example of the step c, dividing the number of pixels corresponding to each gray value in the front image of the offline map by the total number of pixels in the front image of the offline map to obtain a normalized histogram of the front image of the offline map, where the normalized histogram of the front image of the offline map may include data such as 1% of pixels with gray value 0 and 1% of pixels with gray value 1, 1.5% of pixels with gray value 2, 2% … … of pixels with gray value 3, and so on.

And e, calculating a correlation coefficient of the normalized gray level histogram of the front image of the offline map and the normalized gray level histogram of the front image of the user equipment, and determining a calculation result as the similarity of the front image of the user equipment and the front image of the offline map.

The correlation coefficient r of the two normalized gray level histograms can be calculated as follows formula (1):

where i represents a gray value, n is an upper limit of a value range of the gray value, for example 255, xi represents a proportion of pixels of the gray value i in the front image of the offline map, xavg represents an average value of all Xi in the normalized gray histogram of the front image of the offline map, yi represents a proportion of pixels of the gray value i in the front image of the user device, and Yavg represents an average value of all Yi in the normalized gray histogram of the front image of the user device.

S302b, retrieving to obtain a post candidate frame.

The post-candidate frames refer to a plurality of frame images similar to the post-image of the user equipment in step S301b in the post-image of the offline map. In other words, step S302b corresponds to searching for a plurality of frames of images similar to the post-image of the user equipment in the post-image of the offline map.

Similar to step S302a, in step S302b, the post candidate frame may be retrieved according to whether the similarity is greater than a preset similarity threshold, or the front N frames with the highest similarity between the post image of the offline map and the post image of the user equipment may be selected from high to low similarity as the post candidate frames. The embodiment is not limited to the specific embodiment.

In step S302b, the algorithm for calculating the similarity between the post-image of the user device and the post-image of the offline map may be the same as or different from the algorithm for calculating the similarity in step S302 a.

For example, in step S302b, the similarity between the post-image of the user device and the post-image of the offline map may also be calculated by using an algorithm, and the specific calculation method is referred to step S302a, which is not described herein.

It will be appreciated that different scenarios correspond to different offline maps, e.g., store a corresponds to an offline map of store a and school B corresponds to an offline map of school B. The retrieval of the pre-candidate frame and the post-candidate frame in steps S302a and S302b therefore means that the pre-candidate frame and the post-candidate frame are retrieved in an offline map of the scene where the user equipment is currently located.

Therefore, the scene where the user equipment is located can be determined according to the positioning information (such as GPS positioning information) of the user equipment before the candidate frame is retrieved. For example, if the location information of the user equipment indicates that the user equipment is in the range of the mall a, determining that the scene where the user equipment is currently located is the mall a, and searching the candidate frame can be determined to be searched in an offline map of the mall a.

As an example, when the user equipment performs visual positioning in the market a, a front-end camera and a rear-end camera are used for shooting respectively to obtain a front-end image of the market a and a rear-end image of the market a, then the front-end image of the market a is used for searching in an offline map of the market a to obtain a plurality of front-end candidate frames, and the rear-end image of the market a is used for searching in the offline map of the market a to obtain a plurality of rear-end candidate frames.

And S303, reordering the candidate frames.

And reordering the candidate frames, namely, identifying the image combinations in the retrieved plurality of pre-candidate frames and the retrieved plurality of post-candidate frames, then calculating the relative pose error of each image combination, and finally sequencing the identified image combinations from low to high according to the relative pose error, namely, sequencing the image combinations with smaller relative pose error to be more front.

As described above, when the capturing device captures images forming the offline map, the front camera and the rear camera of the capturing device are called to capture images at the same time, and accordingly, each frame of front image of the offline map corresponds to a rear image captured by the capturing device at the same time, and a frame of front image and a frame of rear image captured by the capturing device at the same time in the offline map can be recorded as an image combination in the offline map.

For example, at time T1, a frame of front image and a frame of rear image captured by the capturing device at the same time may be regarded as one image combination.

In some alternative embodiments, referring to fig. 4, the specific implementation of step S303 may include the following steps:

s501, image combinations in the pre-candidate frame and the post-candidate frame are identified.

Identifying image combinations may be accomplished by comparing the times of capture. That is, if the photographing time of one frame pre-candidate frame and the photographing time of one frame post-candidate frame are the same, or the deviation of the photographing times of both is within an acceptable range (for example, less than 0.5 seconds), this frame pre-candidate frame and this frame post-candidate frame may be regarded as one image combination. The capturing device can upload the shooting time of the image to the cloud end together when uploading the image to the cloud end.

Taking fig. 4 as an example, it is recognized that the photographing time of the front candidate frame 1 and the rear candidate frame 1 is the same, the two candidate frames are determined as one image combination and are marked as combination 1-1, the photographing time of the front candidate frame 2 and the rear candidate frame 2 is the same, and the two candidate frames are determined as one image combination and are marked as combination 2-2.

In the process of searching the pre-candidate frame and the post-candidate frame, a pre-image of a frame of an offline map is determined as the pre-candidate frame, and a post-image shot simultaneously in the offline map is not determined as the post-candidate frame, wherein the pre-candidate frame and the post-candidate frame cannot form an image combination; or conversely, a post-image of one frame of the offline map is determined as a post-candidate frame, and a pre-image captured simultaneously in the offline map is not determined as a pre-candidate frame, in which case the post-candidate frame cannot be combined with the pre-candidate frame to form an image.

In step S501, in the above case, the pre-candidate frame and the post-candidate frame that do not constitute the image combination may be deleted directly, that is, if there is a pre-candidate frame that does not constitute the image combination with the post-candidate frame, the corresponding pre-candidate frame may be deleted, and if there is a post-candidate frame that does not constitute the image combination with the pre-candidate frame, the corresponding post-candidate frame may be deleted.

S502, calculating relative pose.

In step S502, the relative pose of each image combination identified in S501 needs to be calculated. Taking fig. 4 as an example, in S501, the pre-candidate frame 1 and the post-candidate frame 1 are identified as one image combination, and the pre-candidate frame 2 and the post-candidate frame 2 are identified as one image combination, then in step S502, the relative pose of the pre-candidate frame 1 and the post-candidate frame 1 needs to be calculated, the calculation result is recorded as the relative pose 1-1, and the relative pose of the pre-candidate frame 2 and the post-candidate frame 2 is calculated, and the calculation result is recorded as the relative pose 2-2.

For ease of calculation, the relative pose of the image combination may correspond to the relative pose of the acquisition device. If the relative pose of the acquisition device is the relative pose of the front camera relative to the rear camera, the relative pose of the front candidate frame relative to the rear candidate frame in the image combination is calculated in S502; if the relative pose of the acquisition device is the relative pose of the rear camera with respect to the front camera, then in S502, the relative pose of the rear candidate frame with respect to the front candidate frame in the image combination is calculated.

The meaning of the relative pose of one image combination will be described below, taking the pre-candidate frame 1 and the post-candidate frame 1 in fig. 4 as an example, and the time when the acquisition device captures the pre-candidate frame 1 and the post-candidate frame 1 will be denoted as t.

Because the pose of the front candidate frame 1 and the pose of the rear candidate frame 1 are obtained by the calculation after three-dimensional reconstruction of the server according to the multi-frame image shot by the acquisition equipment, and the pose of the front candidate frame 1 can be regarded as an estimated value of the pose of the front camera of the acquisition equipment at the moment t by combining the concept of the pose of the front image; the pose of the rear candidate frame 1 can be regarded as an estimated value of the pose of the front camera of the acquisition device at the moment t.

Further, the relative pose of the front candidate frame 1 relative to the rear candidate frame 1 can be regarded as an estimated value of the pose of the front camera of the acquisition device under the camera coordinate system of the rear camera at the time t. Similarly, the relative pose of the rear candidate frame 1 relative to the front candidate frame 1 can be regarded as an estimated value of the pose of the rear camera of the acquisition device under the camera coordinate system of the front camera at the time t.

As described above, the pose can be represented by the translation matrix T and the rotation matrix R, so the pose of the front candidate frame 1 can be denoted as the rotation matrix Rfront-1 and the translation matrix Tfront-1, and the pose of the rear candidate frame 1 can be denoted as the rotation matrix Rback-1 and the translation matrix Tback-1, whereby the relative pose of the front candidate frame 1 with respect to the rear candidate frame 1 can be calculated by the following formula (2):

Wherein:

wherein Rfb-1 represents a rotation matrix of the pre-candidate frame 1 relative to the post-candidate frame 1, and Tfb-1 represents a translation matrix of the pre-candidate frame 1 relative to the post-candidate frame 1. inv denotes that the matrix in the bracket is inversely calculated, pback-1 is the extrinsic matrix of the post-candidate frame 1, ppfront-1 is the extrinsic matrix of the pre-candidate frame 1, and please refer to the description of the pose of the device.

That is, the relative pose of the pre-candidate frame 1 with respect to the post-candidate frame 1 is equal to the inverse of the extrinsic matrix of the post-candidate frame 1, multiplied by the extrinsic matrix of the pre-candidate frame 1.

With reference to the foregoing description of the concept of the offline map, the pose of the pre-candidate frame and the pose of the post-candidate frame used in the calculation process can be read from the offline map.

S503, calculating relative pose errors.

The relative pose error refers to the deviation between the relative pose of an image combination and the relative poses of a front camera and a rear camera of the acquisition device. Through step S503, a deviation of the relative pose of each image combination and the relative pose of the acquisition device may be calculated.

It should be noted that the relative pose of the capturing device in step S503 refers to the pre-calibrated relative pose of the front camera with respect to the rear camera on the capturing device, or the pre-calibrated pose of the rear camera with respect to the front camera.

Taking fig. 4 as an example, in step S503, the deviation of the relative pose 1-1 and the relative pose of the acquisition device is calculated, so as to obtain an error 1-1 of the image combination 1-1, and the deviation of the relative pose 2-2 and the relative pose of the acquisition device is calculated, so as to obtain an error 2-2 of the image combination 2-2.

The relative pose error may be calculated by a variety of methods, and the specific calculation method is not limited in this embodiment.

The following describes a method for calculating the relative pose error in combination with the foregoing example:

the relative pose of the image combination 1-1 is represented by a relative rotation matrix Rfb-1 and a relative translation matrix Tfb-1, and the relative pose of the front camera of the acquisition device relative to the rear camera is represented by a relative rotation matrix Rfb-gt and a relative translation matrix Tfb-gt, then the Angle error Angle-err can be calculated based on the following formula (3):

in equation (3), arccos is an inverse cosine function, and tr (Rfb-gt) represents the trace of the relative rotation matrix Rfb-gt.

The translation error T-err is calculated based on the following equation (4):

in other words, the translation error T-err is equal to the inverse cosine function value of the inner product of the relative translation matrix Tfb-1 and the relative translation matrix Tfb-gt, divided by the product of the determinant of the relative translation matrix Tfb-1 and the determinant of the relative translation matrix Tfb-gt, and the manner in which the inner product of the relative translation matrix Tfb-1 and the relative translation matrix Tfb-gt is calculated can be seen in equation (3).

The sum of the translational error and the angular error can be regarded as the relative pose error of the image combination 1-1, denoted as P-err-1-1, namely:

P-err-1-1＝Angle-err+T-err

s504, reordering the image combinations according to the relative pose errors.

Taking fig. 4 as an example, after calculation at S503, error 1-1 is less than error 2-2, so that in reordering, combination 1-1 is ranked before combination 2-2.

It should be noted that, step S303 is an optional step, in some alternative embodiments, the candidate frames may not be reordered, and steps S304a and S304b may be directly performed after the pre-candidate frame and the post-candidate frame are retrieved.

Step S303 is performed because the pose of the front candidate frame and the pose of the rear candidate frame in the offline map are checked for accuracy.

The pre-calibrated relative pose in the acquisition device, such as the pre-calibrated pose of the front camera under the camera coordinate system of the rear camera, can be considered to be accurate, and the front camera and the rear camera of the acquisition device are fixed, so that the pose of the front camera under the camera coordinate system of the rear camera obviously cannot change in the use process.

The front image and the rear image of the offline map are obtained by shooting by the acquisition equipment through the front camera and the rear camera of the acquisition equipment at the same time, namely, the pose of the front image and the pose of the rear image are in the same world coordinate system.

Therefore, for a front candidate frame and a rear candidate frame which are simultaneously shot by the acquisition equipment in the image combination, if the pose accuracy of the two candidate frames is higher when an offline map is constructed, the relative pose of the image combination, namely the estimated value of the pose of the front camera under the camera coordinate system of the rear camera, which is calculated according to the poses of the two candidate frames should be close to the relative pose calibrated in advance on the acquisition equipment, so the relative pose error should be smaller; otherwise, if the pose of the two candidate frames calculated during the construction of the offline map is inaccurate, the estimated value of the pose of the front camera under the camera coordinate system of the rear camera calculated by the pose calculation method and the calibrated accurate relative pose have larger deviation, namely the relative pose error is larger.

It can be seen that the benefit of performing step S303 is:

the magnitude of the relative pose error of the image combination can reflect the accuracy of the pose of the front candidate frame and the pose of the rear candidate frame in the image combination, the smaller the relative pose error is, the higher the accuracy of the pose of the corresponding front candidate frame and the rear candidate frame is, the front candidate frame and the rear candidate frame with the higher accuracy of the pose can be preferentially used in the subsequent steps according to the sequence from small to large relative pose error, and therefore the accuracy of the pose of the finally determined user equipment is improved.

And S304a, extracting image features of the pre-image and the pre-candidate frame.

In step S304a, image features of the front image may be extracted, respectively, and image features of M front candidate frames may be extracted sequentially from front to back in the order in step S303. M is a preset positive integer, and generally, M can be smaller than the total number of the searched pre-candidate frames.

For example, after sorting in step S303, there are 20 pre-candidate frames in total, in step S304a, only the image features of the first 14 pre-candidate frames may be extracted, and the last 6 pre-candidate frames may be directly removed.

As described in step S303, the accuracy of the pose of the ranked leading candidate frame is lower, and the leading candidate frame with the ranked leading candidate frame is removed in the above manner, so that the pose of the user equipment can be avoided from being calculated by using the leading candidate frame with the lower accuracy of the pose, thereby improving the accuracy of the determined pose of the user equipment.

And S304b, extracting image features of the post-image and the post-candidate frame.

Similar to step S304a, when step S304b is performed, the image features of the post-images may be extracted, respectively, and the image features of the M post-candidate frames may be extracted sequentially from front to back in the order of step S303. M is a preset positive integer, and generally, M can be smaller than the total number of the searched post candidate frames.

For example, after sorting in step S303, there are 20 post-candidate frames in total, and in step S304a, only the image features of the first 14 post-candidate frames may be extracted, and the last 6 post-candidate frames may be directly removed.

The corresponding beneficial effects refer to step S304a, and are not described in detail.

In steps S304a and S304b, any image feature algorithm in the related technical fields such as image processing may be used to extract the image features of the specified image, or a conventional image processing method, for example, a Scale-invariant feature transform (SIFT) method may be used to extract the image features, or a method based on deep learning, for example, a pre-trained convolutional neural network may be used to extract the image features, and the specific algorithm for extracting the image features is not limited in this embodiment.

S305a, matching the image characteristics, and determining 2D-3D point pairs of the front image.

The step of matching the image features is to compare the image features of the pre-image extracted in the step S304a with the image features of the pre-candidate frame, so as to determine which pixel points in the pre-image and the pre-candidate frame are matched with each other.

For example, by comparing the image characteristics of the front image and the image characteristics of the front candidate frame, it can be determined that the pixel point A1 of the front image and the pixel point A1' of the front candidate frame match each other.

In conjunction with an exemplary matching process, in step S305a, the similarity between the image feature of an area in the pre-image and the image feature of an area in the pre-candidate frame may be calculated, and if the similarity between the two is greater than a preset similarity threshold, the pixel point at the center of the area in the pre-image and the pixel point at the center of the area in the pre-candidate frame may be considered to be matched with each other.

Two pixel points in the two frames of images are matched with each other, namely, the two pixel points are projected by the same space point. As an example, assuming that the same cube image is displayed in both the front image and the front candidate frame, the pixel point projected by one vertex a of the cube in the front image is A1, and the pixel point projected by the vertex a of the cube in the front candidate frame is A1', in this case, the pixel point A1 of the front image and the pixel point A1' of the front candidate frame may be regarded as two mutually matched pixel points.

Determining the 2D-3D point pair of the front image may be understood as determining the pixel point in the front image, which of the plurality of spatial points recorded by the offline map is the projection of the spatial point, that is, determining the projection relationship of the pixel point of the front image, and determining the pixel point of the front image and the spatial point from which the pixel point is projected, that is, a 2D-3D point pair.

In view of the foregoing description of the concept of the offline map, the offline map includes the projection relationship of a plurality of pixels in the pre-candidate frame, that is, the coordinates of the spatial points that include the pixels that project the pre-candidate frame, so, as long as the pixels that match the pixels of the pre-candidate frame are found in the pre-image, the projection relationship of the pixels of the pre-candidate frame can be determined as the projection relationship of the pixels in the pre-image that match each other.

S305b, matching the image characteristics, and determining 2D-3D point pairs of the rear image.

Similarly to step S305a, in step S305b, by comparing the image features of the post-image with the image features of the post-candidate frame, it is determined which pixels in the post-image and the post-candidate frame match each other, so that the projection relationship of the pixels in the post-candidate frame is determined as the projection relationship of the pixels in the post-image that match each other.

The execution of step S305b is described below in conjunction with the example of fig. 5:

the post-candidate frame 501 is a post-image of the offline map of one frame retrieved in the previous step, and the offline map records the projection relationship of the pixel 511 in the post-candidate frame 501, that is, the projection of the pixel 511 is the spatial point 510 with coordinates of (X0, Y0, Z0).

S51, feature matching.

In executing step S305b, the image features of the post-candidate frame 501 are first compared with the image features of the post-image 502 captured by the user device 200, so as to find a pixel point in the post-image that matches the pixel point 511 of the post-candidate frame.

After comparison, it is determined that the pixel 512 of the post-image 502 matches the pixel 511 of the post-candidate frame, i.e., it is determined that the pixel 512 and the pixel 511 are projections of the same spatial point on different images.

S52, determining the pixel point 512 and the space point 510 as 2D-3D point pairs.

After determining that the pixel 512 of the post-image 502 matches the pixel 511 of the post-candidate frame, the projection relationship of the pixel 511 in the offline map may be determined as the projection relationship of the pixel 512 of the post-image 502, that is, the projection of the pixel 512 of the post-image is determined to be the spatial point 510 with coordinates (X0, Y0, Z0), that is, the pixel 512 and the spatial point 510 are determined to be a 2D-3D point pair of the post-image.

The execution of step S305a is similar to the example of fig. 5, and will not be described again.

S306, calculating the pose of the user equipment by combining the 2D-3D point pairs of the front image and the rear image and the relative pose of the user equipment.

The relative pose of the user equipment refers to the relative pose between the front camera and the rear camera of the user equipment, specifically, the relative pose of the front camera of the user equipment relative to the rear camera, that is, the pose of the front camera of the user equipment under the camera coordinate system of the rear camera, and also the relative pose of the rear camera of the user equipment relative to the front camera. The following description will take, as an example, a relative pose of a front camera of the user device with respect to a rear camera.

Similar to the acquisition device, the relative pose of the user device belongs to a part of parameters of the user device calibrated in advance, that is, the manufacturer calibrates the relative pose of the user device by various testing means before the user device leaves the factory and writes the relative pose of the user device into the memory of the user device, and in step S306, the calibrated relative pose can be directly obtained from the user device.

After the 2D-3D point pairs of the front-facing image are obtained, any algorithm for calculating the pose of the equipment in the related technical fields such as visual positioning can be adopted to calculate the pose of the front-facing camera of the user equipment; and similarly, the pose of the rear camera of the user equipment can be obtained by calculating according to the 2D-3D point pairs of the rear image by using the same algorithm. As an example, the algorithm used may be an n-Point Perspective (PnP) method, a high-efficiency n-Point (Efficient Perspective-n-Point, EPnP) method, a 3-Point Perspective (P3P) method, and the like, and the embodiment is not limited to the algorithm specifically used.

An alternative embodiment of step S306 is described below using the P3P algorithm as an example. Referring to fig. 6, the specific embodiment of step S306 may include the following steps:

s601, selecting a plurality of 2D-3D point pairs of the front image and the rear image for loading.

Since the algorithm used in the present embodiment is a P3P algorithm, in step S601, 3 unloaded 2D-3D point pairs of the front image and 3 unloaded 2D-3D point pairs of the rear image may be selected, respectively.

The 2D-3D point pairs of the front-facing image are used for calculating the pose estimation value of the front-facing camera of the user equipment, and the 2D-3D point pairs of the rear-facing image are used for calculating the pose estimation value of the rear-facing camera of the user equipment.

As an example, the 3 2D-3D point pairs of the post-image selected in step S601 may be noted as:

A1(A1u，A1v)，A(Axw，Ayw，Azw)；

B1(B1u，B1v)，B(Bxw，Byw，Bzw)；

C1(C1u，C1v)，C(Cxw，Cyw，Czw)；

wherein A1, B1 and C1 are three pixels on the post-image, and the coordinates of the pixels are the coordinates of the pixels on the image, for example, A1 (A1 u, A1 v), which indicates that the pixel A1 is the pixel on the post-image in the column A1u and the row A1 v. A, B and C are space points projected out of the three pixel points, and the coordinates of the space points are coordinates in a world coordinate system.

S602, calculating coordinates of spatial points in the 2D-3D point pairs in a camera coordinate system of the corresponding camera.

In step S602, a pair of 2D-3D points of the post-image and a pair of 2D-3D points of the pre-image are respectively performed. Namely, according to the 2D-3D point pairs of the front-facing image, calculating to obtain the coordinates of the space points in the camera coordinate system of the front-facing camera; and calculating coordinates of the space points in a camera coordinate system of the rear camera according to the 2D-3D point pairs of the rear image.

A method of calculating coordinates of a spatial point in a camera coordinate system is described in connection with the example of step S601.

The origin of a camera coordinate system of the rear camera, namely the optical center of the rear camera is recorded as O, and the following equation set (5) can be determined according to the cosine law:

OA ² +OB ² -2OA·OB·cos<A，B>＝AB ²

OA ² +OC ² -2OA·OC·cos<A，C>＝AC ²

OB ² +OC ² -2OB·OC·cos<B，C>＝BC ²

in equation set (5), OA, OB and OC represent the distances from the origin O to the spatial points a, B and C in order, AB, AC and BC represent the distances between the spatial points a, B and C, cos < a, B > represents the cosine value of the angle AOB, cos < a, C > represents the cosine value of the angle AOC, and cos < B, C > represents the cosine value of the angle BOC.

Equation set (5) may be modified to equation set (6) below:

formula (1): x is x ² +y ² -2x·y·cos<A，B>＝u

Formula (2): x is x ² +1-2x·cos<A，C>＝wu

Formula (3): y is ² +1-2y·cos<B，C>＝vu

Wherein:

simplifying equation set (6) may result in equation set (7) as follows:

(1-w)x ² -w·y ² -2x·cos<A，C>+2w·x·y·cos<AB>+1＝0

(1-v)y ² -v·x ² -2y·cos<B，C>+2v·x·y·cos<A，B>+1＝0

in the equation set (7), x and y are unknowns, and other parameters u, w, cos < A, B >, cos < A, C >, cos < B, C > can be calculated according to the pixel coordinates of the pixel points A1, B1 and C1 in the loaded 2D-3D point pair and the coordinates of the space points A, B and C in the world coordinate system, so that the equation set (7) can be solved to obtain x and y.

The equation set (7) may be solved by using any mechanized algorithm of any binary quadratic equation set in any related technology, and the specific solving method is not limited in this embodiment. Illustratively, the system of equations (7) may be solved using the evolutionary method to obtain x and y.

After x and y are obtained, substituting the obtained x and y into the equation (6) in the formula 1, u can be calculated, OC can be obtained through calculation, and OA and OB can be obtained through calculation by combining the definition of x and y, namely, the distances between the space points A, B and C and the origin O of the camera coordinate system of the rear camera are obtained.

Since A1, B1 and C1 are respectively equivalent to the projection of A, B and C on the imaging plane of the rear camera, the three points O, A1 and A are collinear; the three points O, B1 and B are collinear; and the three points O, C1 and C are collinear, and the coordinates of the three points A, B and C under the camera coordinate system of the rear camera are easy to calculate on the basis of knowing the distances between the space points A, B and C and the origin O.

In practical application, parameters except x and y in the equation set (7) can be calculated according to the loaded 2D-3D point pairs, then the parameters are input into a solution program of the pre-configured equation set (7) to solve the x and y, and then the coordinates of the spatial points A, B and C in the 2D-3D point pairs under the camera coordinate system of the corresponding cameras are calculated.

S603, calculating the pose estimated value of the user equipment according to the coordinates of the space point in the camera coordinate system and the coordinates of the space point in the world coordinate system.

For any one space point, the coordinates (Xc, yc, zc) of the space point under the camera coordinate system of the camera, the coordinates (Xw, yw, zw) of the space point under the world coordinate system and the pose of the camera under the world coordinate system satisfy the relation:

wherein R is the rotation matrix of the camera, T is the translation matrix of the camera, and R and T are the pose of the camera in the world coordinate system.

Therefore, for the front camera of the user equipment, after the coordinates of the plurality of spatial points in the camera coordinate system of the front camera are obtained in S602, the pose estimation value of the front camera of the user equipment in the world coordinate system can be solved by using the above relation, and similarly, the pose estimation value of the rear camera of the user equipment in the world coordinate system can be solved by using the coordinates of the plurality of spatial points obtained in S602 in the camera coordinate system of the rear camera.

For convenience of explanation, the pose estimation value of the front camera of the user equipment may be represented by a rotation matrix estimation value Rfront-test and a translation matrix estimation value Tfront-test, and the pose estimation value of the rear camera of the user equipment may be represented by a rotation matrix estimation value Rback-test and a translation matrix estimation value Tback-test.

In some alternative embodiments, in executing S603, a plurality of sets of space points may be selected, where each set of space points may be used to obtain a pose estimation value of a user device, and then a pose estimation value with the smallest relative pose error is selected as the pose estimation value output in step S603.

Steps S602 and S603 may be regarded as specific implementation procedures of the P3P algorithm in this embodiment.

S604, optimizing the pose estimation value of the user equipment by using a nonlinear optimization method.

The nonlinear optimization algorithm is that an optimization function related to an optimized object is set for the optimized object, after the optimized object is obtained each time, a corresponding optimization function value is calculated based on the optimized object, if the optimization function value does not meet a set optimization ending condition, for example, is not smaller than or not larger than a set optimization threshold, a new optimized object is calculated, and the loop is performed, when a certain optimization function meets the optimization ending condition, namely, the optimized object at that time is output.

In this embodiment, the optimization object is a pose estimation value of a front camera and a pose estimation value of a rear camera of the user equipment.

The optimization function then comprises at least the relative pose error of the user device. The relative pose error of the user equipment refers to the error between the calculated relative pose estimated value between the front camera and the rear camera of the user equipment and the pre-calibrated relative pose of the front camera and the rear camera in the user equipment according to the pose estimated value of the front camera and the pose estimated value of the rear camera.

The optimization end condition is to minimize the optimization function.

It can be seen that in step S604, the relative pose error of the user equipment may be minimized, that is, the relative pose estimated value of the user equipment is as close as possible to the calibrated relative pose of the user equipment, and the pose estimated value of the user equipment is continuously adjusted until the relative pose estimated value of the user equipment is maximally close to the calibrated relative pose of the user equipment.

In other words, in the process of optimizing the pose estimation value of the user equipment in this embodiment, the calibrated relative pose of the user equipment is introduced as the constraint condition for optimization.

The relative pose estimated value between the front camera and the rear camera of the user equipment can be represented by a relative rotation matrix estimated value Rfb-test and a relative translation matrix estimated value Tfb-test, and correspondingly, the identified relative pose of the user equipment can be represented by a relative rotation matrix Rfb-user of the user equipment and a relative translation matrix Tfb-user of the user equipment.

The relative pose error of the user device may include an angle error of the user device and a pan error of the user device.

Wherein, the Angle error Angle-u-err of the user equipment can be calculated according to the following formula (8):

In equation (8), tr (Rfb-user) represents the trace of the relative rotation matrix Rfb-user of the user device.

The translation error T-u-err of the user equipment can be calculated as follows equation (9):

the meaning of each symbol in the formulas (8) and (9) can be seen from the foregoing formula (3).

In some optional examples, the relative pose estimation value between the front and rear cameras of the user device may be a pose estimation value of the front camera of the user device under a camera coordinate system of the rear camera, and the relative pose calibrated in the user device may be a pose estimation value of the front camera of the user device under a camera coordinate system of the rear camera.

In this example, the relative pose estimation value may be calculated as the following formula (10):

wherein:

the meaning of the symbols in the formula (10) can be referred to the formula (2) and will not be described in detail.

In still other optional examples, the relative pose estimated value between the front and rear cameras of the user device may also be a pose estimated value of the rear camera of the user device under the camera coordinate system of the front camera, and the relative pose calibrated in the user device may be a pose estimated value of the rear camera of the user device under the camera coordinate system of the front camera. The calculation method can be a formula (10) at the moment and is not repeated.

In step S604, the optimization function may be a sum of an angle error and a translation error, or may be a sum of an angle error, a translation error, and other optional errors, which are not limited in this embodiment.

The nonlinear optimization method in step S604 may be any nonlinear optimization method in the related art, and the nonlinear optimization method specifically used in this embodiment is not limited.

As an example, the pose estimation value of the user equipment calculated in step S603, that is, the pose estimation value of the front camera and the pose estimation value of the rear camera of the user equipment may be optimized based on the bundle adjustment (bundle adjustment, BA) method in step S604.

An alternative implementation of step S604 is illustrated below by the BA method:

and a step a of obtaining the pose estimated value of the front camera and the pose estimated value of the rear camera of the user equipment calculated in the step S603.

And b, respectively calculating a translation error T-u-err of the user equipment, an Angle error Angle-u-err of the user equipment, a reprojection error (which can be recorded as Re-front-err) of the front image and a reprojection error (which can be recorded as Re-back-err) of the rear image.

The calculation method of the angle error and the translation error of the ue is shown in the formulas (8) and (9).

And c, determining whether the optimization function meets the optimization ending condition.

And b, optimizing the function, namely, optimizing the function Target to be the sum of the translational error of the user equipment, the angle error of the user equipment, the reprojection error of the front camera and the reprojection error of the rear camera in the step b:

Target＝Re-back-error+Re-front-error+Angle-u-err+T-u-err

and (5) re-projecting errors. And (3) projecting the space point in the 2D-3D point pair to the imaging plane of the camera again according to the estimated value of the current camera pose, namely projecting the space point into the corresponding image again to obtain a reprojection pixel point, wherein the deviation between the coordinates of the reprojection pixel point in the image and the coordinates of the pixel point in the 2D-3D point pair is the reprojection error of the estimated value of the current camera pose.

In combination with the previous example, the spatial points a, B and C in the point pairs (A1, a), (B1, B), (C1, C) are re-projected onto the rear image according to the pose estimation value of the rear camera, so as to obtain corresponding re-projected pixel points A1', B1' and C1'. The sum of the deviations of the coordinates of A1' and A1, the deviations of the coordinates of B1' and B1, and the deviations of the coordinates of C1' and C1 can be regarded as the re-projection error of the rear camera.

As previously described, the optimization end condition is to minimize the optimization function.

In some alternative embodiments, the optimization end condition may specifically be represented by that the value of the optimization function calculated in step c is smaller than a preset optimization threshold.

In other alternative embodiments, the optimization ending condition may specifically be represented by a value of the optimization function calculated when the step c is executed this time and a value of the optimization function calculated when the step c is executed last time, where a deviation between the values is smaller than a preset optimization threshold.

The specific optimization termination condition to be used may be determined according to practical situations, and this embodiment is not limited thereto.

Step d, step S601 to S603 are executed again, and step a is executed again after obtaining the pose estimation value of the new front camera and the pose estimation value of the new rear camera.

In step D, when steps S601 to S603 are performed again, a 2D-3D point pair different from that of the previous execution may be loaded, thereby calculating a new pose estimation value of the front and rear cameras.

That is, in step S305, by comparing the image features, a plurality of 2D-3D point pairs of the front image and a plurality of 2D-3D point pairs of the rear image may be obtained by matching, and each time steps S601 to S603 are performed, only a part of the 2D-3D point pairs may be loaded, and the next time steps S601 to S603 are performed, another 2D-3D point pair may be loaded.

In combination with the foregoing examples, when steps S601 to S603 are performed for the first time, the pose estimation value of the rear camera of the user equipment is calculated according to the point pairs (A1, a), (B1, B), (C1, C) of the rear image, and when steps S601 to S603 are performed again in step D, different 2D-3D point pairs may be loaded, for example, (D1, D), (E1, E), (F1, F) of the rear image are loaded, so as to calculate the pose estimation value of the rear camera of the new user equipment.

And e, ending the optimization method.

S605, determining the pose estimation value of the user equipment as the pose of the user equipment.

After the optimization is finished, the estimated value of the pose of the current user equipment can be regarded as the pose of the current user equipment, namely the pose of a front camera and the pose of a rear camera of the user equipment when the front image and the rear image are shot.

The visual positioning method provided by the embodiment of the application has the following beneficial effects:

in the first aspect, when the user equipment performs visual positioning, the front camera and the rear camera are used for shooting the front image and the rear image at the same time, so that the field of view of the user equipment during visual positioning is enlarged, and the user equipment obtains more image information in a scene where the user equipment is located.

In the second aspect, in scenes lacking significant image features, such as indoor white wall scenes, outdoor seaside scenes, and the like, the accuracy of the visual positioning result is improved. In a scene lacking significant image features, by extracting the image features and comparing the image features, fewer 2D-3D point pairs can only be determined in the image shot by the user equipment, and the number of 2D-3D point pairs is too small to have a negative effect on the accuracy of the visual positioning result. In this embodiment, compared with the case of only photographing the rear image, photographing the front image and the rear image at the same time obviously increases the number of the determined 2D-3D point pairs, thereby contributing to improving the accuracy of the visual positioning result.

The visual positioning result is equivalent to the pose of the user equipment determined in step S605, that is, the pose of the front camera and the pose of the rear camera of the user equipment.

In the third aspect, the introduction of the calibrated relative pose of the user equipment as a constraint condition for visual positioning (see step S604) also helps to improve the accuracy of the visual positioning result. Similar to the acquisition equipment, for a front image and a rear image which are shot by the front camera and the rear camera of the user equipment at the same time, if the pose estimation value of the front camera and the pose estimation value of the rear camera which are calculated based on the two frames of images are consistent with the real pose of the front camera and the rear camera during shooting, the relative pose estimation value calculated by the pose estimation value of the front camera and the pose estimation value of the rear camera also should be consistent with the real relative pose calibrated in the user equipment in advance. It can be seen that the closer the relative pose estimation value is to the relative pose calibrated by the user equipment, the more accurate the calculated pose estimation value of the front and rear cameras, whereas the more the relative pose estimation value is deviated from the relative pose calibrated by the user equipment, the more inaccurate the calculated pose estimation value of the front and rear cameras. Therefore, the constraint of the relative pose of the user equipment is introduced on the basis of the traditional visual positioning method, which is helpful to strengthen constraint conditions, thereby improving the accuracy of the visual positioning result.

The visual positioning method provided by the embodiment can be executed in real time after the user equipment starts the function requiring visual positioning. In combination with the example of fig. 2, after the user equipment starts the AR function, a rear image and a front image are obtained by real-time shooting with the rear camera and the front camera, wherein the rear image is displayed on the screen, and the front image is not displayed on the screen. And after the shooting is finished once, the user equipment determines and calls the visual positioning method shown in fig. 3 based on the rear-end image and the front-end image obtained by shooting, so that the pose of the user equipment in the shooting is obtained, and the position of the specific object image on the screen is determined according to the pose of the rear-end camera.

The real-time execution of the visual positioning method has the advantages that the current pose of the user equipment is determined in real time, the accurate realization of relevant functions requiring visual positioning on the user equipment is ensured, and the use experience of the user is improved.

In some alternative embodiments, the determination of the pose of the user device may also be performed at intervals, for example, every 2 seconds. In combination with the example of fig. 2, after the AR function of the user equipment is turned on, the user equipment performs shooting through the rear camera and the front camera at intervals of 2 seconds at the same time to obtain a rear image and a front image, and then determines and invokes a visual positioning method to determine the pose of the user equipment when shooting this time based on the rear image and the front image obtained by shooting, at this time, the user equipment can determine the position of the physical image on the screen according to the pose of the user equipment determined last time.

The visual positioning method is performed at intervals of a certain time, so that the power consumption of the user equipment can be reduced, and the service time of the user equipment can be prolonged.

The above embodiments are only some optional embodiments of the visual positioning method of the present application, and the embodiments may be adjusted as required in practical applications, and the specific embodiments of the visual positioning method of the present application are not limited.

The visual positioning method provided by the application can be mainly executed by a software platform (such as a cloud) adopting an application program virtualization technology, the user equipment triggers the cloud to execute the visual positioning method in a mode of sending a request to the cloud, and after the cloud execution is finished, the calculated pose of the user equipment is fed back to the user equipment.

Fig. 7 is an interaction schematic diagram of a user device and a cloud end according to an embodiment of the present application.

S701, the user device photographs the front image and the rear image.

The implementation of step S701 may refer to steps S301a and S301b in the embodiment shown in fig. 3, and will not be described herein.

S702, searching candidate frames by the cloud.

In S702, the user equipment may upload the captured front image and the captured rear image to the cloud end, and the cloud end executes step S702 according to the uploaded front image and rear image.

The specific implementation of step S702 may refer to steps S302a and S302b of the embodiment shown in fig. 3, which are not described herein.

As previously described, it is desirable to find an offline map of the scene in which the user device is currently located when retrieving candidate frames. In this embodiment, when the user equipment uploads the front-end image and the rear-end image to the cloud end, the cloud end can upload its own positioning information to the cloud end together, and the cloud end can determine the scene where the user equipment is currently located according to the positioning information, so that an offline map of the scene where the user equipment is currently located is found out from a plurality of offline maps. For example, the user equipment performs visual positioning in the mall a, and the cloud end can find an offline map of the mall a according to the positioning information.

And S703, reordering the candidate frames by the cloud end.

In step S703, the cloud end may determine a plurality of image combinations according to whether the photographing time of the front candidate frame and the rear candidate frame are the same, then read the pose of each front candidate frame and each rear candidate frame from the offline map, and read the relative pose between the front camera and the rear camera of the acquisition device, calculate the relative pose error of each image combination based on these data, and finally order the candidate frames according to the relative pose error from small to large, that is, order the candidate frames in the image combination with smaller relative pose error to front, and order the candidate frames in the image combination with larger relative pose error to back.

The specific implementation process of step S703 may refer to step S303 in the embodiment shown in fig. 3, which is not described herein.

S704, extracting image features by the cloud.

The cloud invokes an image feature extraction algorithm to extract image features of the front image, the front candidate frame, the rear image and the rear candidate frame, respectively, where the image feature extraction algorithm may be a SIFT method or other methods, and this embodiment is not limited thereto.

The specific implementation of step S704 may be referred to as steps S304a and S304b in the embodiment shown in fig. 3.

S705, the cloud matches the image features to determine 2D-3D point pairs.

The cloud end compares the image characteristics of the front-end image with the image characteristics of the front-end candidate frame so as to determine the pixel points matched with each other in the front-end image and the front-end candidate frame, and then determines the projection relationship of the pixel points in the front-end candidate frame as the projection relationship of the matched pixel points in the front-end image, thereby obtaining the 2D-3D point pair of the front-end image.

Similarly, the cloud end compares the image characteristics of the rear-mounted image with the image characteristics of the rear-mounted candidate frame, so that mutually matched pixel points in the rear-mounted image and the rear-mounted candidate frame are determined, then the projection relationship of the pixel points in the rear-mounted candidate frame is determined as the projection relationship of the matched pixel points in the rear-mounted image, and therefore the 2D-3D point pair of the rear-mounted image is obtained.

The specific implementation of step S705 can be seen in steps S305a and S305b of the embodiment shown in fig. 3.

S706, the cloud calculates the pose of the user equipment according to the 2D-3D point pairs.

The cloud end calculates the pose estimation value of the front camera of the user equipment according to the 2D-3D point pair of the front image, calculates the pose estimation value of the rear camera of the user equipment according to the 2D-3D point pair of the rear image, optimizes the pose estimation value of the front camera and the pose estimation value of the rear camera by taking the relative pose error of the user equipment as a target, and finally determines the pose of the user equipment when the front image and the rear image are shot, namely the pose of the front camera and the pose of the rear camera of the user equipment at the moment.

The specific implementation of step S706 may refer to step S306 of the embodiment shown in fig. 3, which is not described herein.

Referring to fig. 7, after the cloud calculates the pose of the user device, the pose of the user device is fed back to the user device, and specifically, the cloud may feed back the pose of the front camera and the pose of the rear camera of the user device to the user device. Thus, the user equipment obtains its own pose through the visual positioning method of the embodiment.

The method and the device have the advantages that on one hand, the user equipment does not need to store an offline map, the storage space of the user equipment is saved, and on the other hand, the cloud has higher processing speed relative to the user equipment, so that the pose of the user equipment can be calculated more quickly, and the efficiency of the visual positioning method is improved.

The application does not limit the execution subject of each step in the visual positioning method. In some alternative embodiments, the visual positioning method may also be performed entirely by the user device. In other words, on the premise that the performance of the user equipment allows, the user equipment may also download the offline map of the scene where the user equipment is located from the cloud in advance, for example, when entering the mall a, the user equipment sends positioning information of the user equipment to the cloud in advance, the cloud finds a corresponding scene, that is, the offline map of the mall a, according to the positioning information, then the user equipment may download the offline map of the mall a of the cloud, and then execute each step in the embodiment shown in fig. 3 according to the offline map of the mall a, so as to obtain the pose of the user equipment.

The method has the advantages that the influence of network quality on the visual positioning process can be avoided, and the pose of the user equipment can be successfully determined by the visual positioning method even if the network quality of the user equipment is poor or no network connection exists in the corresponding scene.

According to the above user interface and the corresponding implementation process, a visual positioning method may be obtained, referring to fig. 8, and the visual positioning method of the present embodiment may include the following steps:

s801, a front image and a rear image are obtained.

The front image is an image shot by a front camera of the first device, and the rear image is an image shot by a rear camera of the first device.

The first device may be understood as a user device in the embodiment shown in fig. 3, for example a smart phone used by a user entering a mall.

Specific implementation of step S801 can be seen from steps S301a and S301b of the example shown in fig. 3.

S802, searching and obtaining a front candidate frame similar to the front image and a rear candidate frame similar to the rear image in the offline map.

The offline map includes a plurality of frames of images captured by the second device. The second device may be understood as the acquisition device in the embodiment shown in fig. 3.

Specific implementation of S802 may refer to steps S302a and S302b of the embodiment shown in fig. 3.

S803, determining a point pair of the front image according to the front candidate frame, and determining a point pair of the rear image according to the rear candidate frame.

The point pair of the front image comprises the pixel point of the front image and the space point of the pixel point of the projected front image, and the point pair of the rear image comprises the pixel point of the rear image and the space point of the pixel point of the projected rear image.

The point pair of the front image in step S803 is the 2D-3D point pair of the front image in the embodiment shown in fig. 3, and the point pair of the rear image is the 2D-3D point pair of the rear image in the embodiment shown in fig. 3.

The implementation process of S803 can be seen in steps S304a, S304b, S305a and S305b of the embodiment shown in fig. 3.

In some alternative embodiments, after S803 is performed, before S804 is performed, the pre-candidate frame and the post-candidate frame may be reordered, that is:

identifying an image combination in the front candidate frame and the rear candidate frame, wherein the image combination comprises a front candidate frame and a rear candidate frame which are shot by the second equipment at the same time;

calculating relative pose errors of the image combinations aiming at each image combination, wherein the relative pose errors of the image combinations are the relative poses of a front candidate frame and a rear candidate frame of the image combinations, the errors of the relative poses of the front candidate frame and the rear candidate frame and the second equipment, and the relative poses of the second equipment are the relative poses of a front camera and a rear camera of the second equipment calibrated in advance;

For the implementation of the above reordering step, reference may be made to step S303 of the embodiment shown in fig. 3.

S804, determining the pose of the first device according to the point pairs of the front image, the point pairs of the rear image and the relative pose of the first device.

The pose of the first device corresponds to the pose of the front camera of the user device and the pose of the rear camera of the user device in the embodiment shown in fig. 3.

The relative pose of the first device is the relative pose of the front camera of the first device and the rear camera of the first device.

The specific implementation process of step S804 may refer to step S306 of the embodiment shown in fig. 3.

The beneficial effects of the present embodiment can be seen in the foregoing embodiments, and will not be described in detail.

An embodiment of the application provides an electronic device including a memory and one or more processors.

The memory is used for storing a computer program.

The one or more processors are configured to execute a computer program, in particular to implement the method of visual localization provided by any of the embodiments of the present application.

The electronic device may be a user device, such as a smart phone used by a user, or may be a cloud server.

The embodiment of the application also provides a computer storage medium for storing a computer program, which is specifically used for realizing the visual positioning method provided by any embodiment of the application when the computer program is executed.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The plurality of the embodiments of the present application is greater than or equal to two. It should be noted that, in the description of the embodiments of the present application, the terms "first," "second," and the like are used for distinguishing between the descriptions and not necessarily for indicating or implying a relative importance, or alternatively, for indicating or implying a sequential order.

Claims

1. A method of visual localization comprising:

reordering the image combinations in the order of the corresponding relative pose errors from small to large;

Determining a point pair of the front image according to the front N reordered candidate frames, and determining a point pair of the rear image according to the front N reordered candidate frames, wherein N is a preset positive integer, the point pair of the front image comprises a pixel point of the front image and a space point of a pixel point of the front image projected, and the point pair of the rear image comprises a pixel point of the rear image and a space point of a pixel point of the rear image projected;

2. The method of claim 1, wherein the looking up in the offline map for pre-candidate frames that are similar to the pre-image and post-candidate frames that are similar to the post-image comprises:

3. The method of claim 1, wherein after identifying the image combinations in the pre-candidate frame and the post-candidate frame, further comprising:

4. A method according to claim 3, wherein said determining the point pairs of the pre-image from the top N re-ordered pre-candidate frames comprises:

extracting image features of the front candidate frames and image features of the front images;

5. The method of claim 1, wherein the determining the pose of the first device from the point pairs of the front image, the point pairs of the rear image, and the relative pose of the first device comprises:

6. The method of claim 5, wherein the optimization function of the nonlinear optimization method comprises a relative pose error of the first device, a re-projection error of the front image, and a re-projection error of the rear image.

7. An electronic device comprising a memory, one or more processors;

the memory is used for storing a computer program;

one or more processors configured to execute a computer program, in particular configured to implement the visual positioning method as claimed in any one of claims 1 to 6.

8. A computer storage medium for storing a computer program, which, when executed, is in particular adapted to carry out the visual positioning method according to any one of claims 1 to 6.