CN110532410B

CN110532410B - Camera positioning and neural network training method and device

Info

Publication number: CN110532410B
Application number: CN201910815145.XA
Authority: CN
Inventors: 丁明宇; 王哲; 石建萍
Original assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-06-21
Anticipated expiration: 2039-08-30
Also published as: CN110532410A

Abstract

The embodiment of the disclosure provides a method and a device for camera positioning and training a neural network, wherein the method comprises the following steps: searching in an image database to obtain an initial pairing image of a query image; the absolute pose of a camera corresponding to the image in the image database is known; acquiring the relative poses of the initial pairing image and the query image; determining a camera estimation pose of the query image according to the predicted camera relative pose; according to the camera estimation pose of the query image, retrieving a new pairing image of the query image from an image database; predicting the relative poses of the cameras of the newly-paired image and the query image; and determining the absolute camera pose of the query image based on the relative camera poses of the new paired image and the query image and the absolute camera pose of the new paired image. The present disclosure improves the accuracy of camera positioning.

Description

Camera positioning and neural network training method and device

Technical Field

The present disclosure relates to machine learning technologies, and in particular, to a method and an apparatus for training camera positioning and neural networks.

Background

With the improvement of living standard of people, people can not leave maps and automobiles when going out. Recent development of computer vision provides great convenience to life, and images shot by a camera are required to be positioned no matter map navigation or automobiles. In addition, the navigation of the robot does not depart from the vision-based camera positioning method, so that the camera positioning is an important task in the field of computer vision. The method can be applied to multiple tasks such as automatic driving, robots and map navigation and has important significance.

In the conventional camera positioning method, there is a relative positioning method, that is, the absolute pose of the camera in one image is obtained according to the relative pose in the two images and the absolute pose of the camera in the other image. But currently the positioning accuracy of such relative positioning is low.

Disclosure of Invention

In view of the above, the present disclosure at least provides a method and an apparatus for training a camera positioning and neural network to improve the accuracy of the camera positioning.

In a first aspect, a camera positioning method is provided, where the method includes:

retrieving an initial pairing image of a query image in an image database; the absolute pose of a camera corresponding to the image in the image database is known;

acquiring the relative poses of the initial pairing image and the query image; determining a camera estimation pose of the query image according to the predicted camera relative pose;

according to the camera estimation pose of the query image, retrieving a new pairing image of the query image from the image database;

predicting the relative camera poses of the new pairing image and the query image;

determining a camera absolute pose of the query image based on the camera relative poses of the new and query images and the camera absolute pose of the new pairing image.

In some embodiments, before retrieving the initial paired image of the query image in the image database, the method further comprises: acquiring a known geographic database, wherein the known geographic database comprises a plurality of images with known absolute poses of cameras; and selecting an image corresponding to a preset geographic area from the known geographic database to construct the image database.

In some embodiments, before retrieving the initial paired image of the query image in the image database, the method further comprises: collecting a plurality of images in a preset geographic area through a collection intelligent terminal provided with a collection camera; determining the absolute poses of the cameras corresponding to the acquired images respectively; and constructing an image database corresponding to the preset geographic area according to the acquired images and the absolute poses of the cameras thereof.

In some embodiments, the predetermined geographic area corresponding to each image of the image database is any one of the following types of areas: a map navigation area, an intelligent driving positioning area, or a robot navigation area.

In some embodiments, the retrieving the initial paired images of the query image in the image database includes: receiving the query image to be subjected to camera positioning; extracting image features of the query image; and retrieving the initial pairing image of the query image from the image database according to the image characteristics of the query image.

In some embodiments, the camera localization method is performed by a camera localization apparatus comprising a camera localization neural network; the camera localization neural network includes: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network; the query image, the initial pairing image and the new pairing image are subjected to image feature extraction processing through the sharing sub-network to respectively obtain images after sharing processing; before retrieving the initial pairing image of the query image in the image database, the method further comprises: sharing the processed query image, and performing rough retrieval sub-network processing on the query image to obtain the image characteristics of the query image so as to retrieve the initial pairing image according to the image characteristics; the acquiring of the predicted camera relative pose of the initial pairing image and the query image comprises: the initial pairing image and the query image after sharing processing are processed by the fine retrieval sub-network, and the fine retrieval sub-network outputs the predicted relative poses of the camera and the query image; the predicting the camera relative poses of the new pairing image and the query image comprises: and the shared new pairing image and the query image are processed by the relative pose regression sub-network, and the relative pose regression sub-network outputs the relative poses of the cameras of the new pairing image and the query image.

In some embodiments, the relative pose regression sub-network comprises a decoding network portion and a regression network portion; the sharing processed new pairing image and the query image are processed by the relative pose regression sub-network, and the relative pose regression sub-network outputs the relative poses of the cameras of the new pairing image and the query image, and the sharing processing comprises the following steps: inputting the image pair of the newly-paired image and the query image after sharing processing into the relative pose regression subnetwork, and obtaining the image characteristics of each image in the image pair after processing of a decoding network part in the relative pose regression subnetwork; splicing the image characteristics of the newly paired image and the query image to obtain spliced characteristics; and after the splicing characteristics are processed by the regression network part of the relative pose regression sub-network, outputting the predicted camera relative poses of the query image and the newly matched image.

In some embodiments, the fine search sub-network comprises a decoding network portion and a regression network portion; the shared initial pairing image and the shared inquiry image are processed by the fine search sub-network, and the predicted relative poses of the initial pairing image and the shared inquiry image are output by the fine search sub-network, wherein the processing comprises the following steps: the image pair of the initial pairing image and the query image after sharing processing is input into the fine search sub-network, and the image characteristics of each image in the image pair are obtained after the processing of a decoding network part in the fine search sub-network; splicing the image characteristics of the initial pairing image and the query image to obtain a splicing characteristic; and after the splicing characteristics are processed by the regression network part of the fine search sub-network, outputting the predicted relative poses of the camera of the predicted inquiry image and the initial pairing image.

In some embodiments, before the query image after the sharing process is processed by a coarse search sub-network to obtain the image features of the query image, the method further includes: using a pre-trained camera to locate a sharing sub-network and a coarse retrieval sub-network in a neural network, extracting image features for each image in the image database; and labeling the image characteristics of each image so as to perform image retrieval according to the image characteristics.

In some embodiments, the retrieving the initial paired image of the query image in the image database includes: retrieving a plurality of initial pairing images of the query image to obtain a plurality of new pairing images according to the plurality of initial pairing images; the determining an absolute pose of a camera of the query image comprises: and respectively obtaining the absolute camera pose of the query image according to each new pairing image of the new pairing images, and obtaining the absolute camera pose of the query image according to the absolute camera poses.

In a second aspect, a training method for a camera localization neural network is provided, the method including:

acquiring a plurality of groups of image pairs, wherein each group of image pair comprises an inquiry image and a matched image, the matched image and the inquiry image respectively have corresponding absolute poses of a camera, and the image pairs also have labeling information of relative poses;

predicting, by a camera localization neural network, a camera relative pose between the query image and a counterpart image for any image pair; determining a camera estimate pose of the query image based on the relative pose and a camera absolute pose of a paired image;

retrieving a new pairing image of the query image from an image database according to the camera estimation pose of the query image, wherein the new pairing image and the query image form a new image pair;

predicting, by a camera positioning neural network, a camera relative pose of the query image and the new paired image in the new image pair; and adjusting network parameters of the camera positioning neural network based on the difference between the predicted information and the labeled information of the relative pose of the camera.

In some embodiments, the camera localization neural network comprises: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network;

the query image, the initial pairing image and the new pairing image are subjected to image feature extraction processing through the sharing sub-network to respectively obtain images after sharing processing; after the acquiring of the plurality of sets of image pairs, the method further comprises: the shared query image and the shared matching image are processed through a coarse retrieval sub-network to obtain an image relation parameter between the query image and the matching image; the paired image and the query image after sharing processing are processed by the fine retrieval sub-network, and the fine retrieval sub-network outputs the predicted relative poses of the cameras; sharing the processed new pairing image and the query image, and outputting the relative poses of the new pairing image and the query image by the relative pose regression sub-network through the relative pose regression sub-network; the adjusting network parameters of the camera positioning neural network comprises: and adjusting the network parameters of the sharing sub-network, the rough retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network according to the difference between the prediction information and the labeling information of the image relation parameters, the difference between the prediction information and the labeling information of the relative pose of the camera output by the fine retrieval sub-network and the difference between the prediction information and the labeling information of the relative pose of the camera output by the relative pose regression sub-network.

In some embodiments, the obtaining of the image relationship parameter between the query image and the pairing image includes: and determining the predicted relative angle offset of the camera poses of the query image and the matched image as the image relation parameter according to the rotation poses of the query image and the matched image in a group of image pairs.

In some embodiments, the obtaining of the image relationship parameter between the query image and the pairing image includes: grouping a plurality of image pairs corresponding to the same query image according to the difficulty degree of regressing relative poses; respectively obtaining the image characteristic distance of each image pair in different groups; and obtaining a predicted value of hard sample mining loss according to the image characteristic distance, wherein the hard sample mining loss is used for expressing the relation between any image characteristic distances in different groups.

In a third aspect, there is provided a camera positioning device, the device comprising:

the initial retrieval module is used for retrieving an initial pairing image of the query image in the image database; the absolute pose of a camera corresponding to the image in the image database is known;

the initial prediction module is used for acquiring the relative pose of the initial pairing image and the predicted camera of the query image; determining a camera estimation pose of the query image according to the predicted camera relative pose;

the retrieval module is used for retrieving a new matched image of the query image from the image database according to the camera estimation pose of the query image;

a re-prediction module for predicting the relative camera poses of the new pairing image and the query image;

a positioning determination module, configured to determine a camera absolute pose of the query image based on the camera relative poses of the new paired image and the query image, and the camera absolute pose of the new paired image.

In some embodiments, the apparatus further comprises: the system comprises a first image acquisition module, a second image acquisition module and a query image generation module, wherein the first image acquisition module is used for acquiring a known geographic database before an initial pairing image of a query image is retrieved from the image database, and the known geographic database comprises a plurality of images with known absolute poses of a camera; and selecting an image corresponding to a preset geographic area from the known geographic database to construct the image database.

In some embodiments, the apparatus further comprises: the second image acquisition module is used for acquiring a plurality of images in a preset geographic area through an acquisition intelligent terminal provided with an acquisition camera before the initial pairing image of the query image is retrieved from the image database; determining the absolute poses of the cameras corresponding to the acquired images respectively; and constructing an image database corresponding to the preset geographic area according to the acquired images and the absolute poses of the cameras thereof.

In some embodiments, the initial retrieval module is specifically configured to: receiving the query image to be subjected to camera positioning; extracting image features of the query image; and retrieving the initial pairing image of the query image from the image database according to the image characteristics of the query image.

In some embodiments, the camera localization apparatus comprises a camera localization neural network; the camera localization neural network includes: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network; the sharing sub-network is used for respectively carrying out image feature extraction processing on the query image, the initial pairing image and the new pairing image to respectively obtain images after sharing processing; the initial retrieval module, before being configured to retrieve an initial pairing image of the query image, is further configured to: carrying out rough retrieval sub-network processing on the shared query image to obtain the image characteristics of the query image so as to retrieve the initial pairing image according to the image characteristics; the initial prediction module, when configured to obtain the predicted relative pose of the camera between the initial pairing image and the query image, includes: processing the initial pairing image and the query image after sharing processing through the fine retrieval sub-network, and outputting the predicted camera relative poses of the initial pairing image and the query image through the fine retrieval sub-network; the re-prediction module, when configured to predict the relative camera poses of the new pairing image and the query image, includes: and processing the newly-paired image and the query image after sharing processing through the relative pose regression sub-network, and outputting the relative poses of the cameras by the relative pose regression sub-network.

In some embodiments, the relative pose regression sub-network comprises a decoding network portion and a regression network portion; the re-prediction module is specifically configured to: inputting the image pair of the newly-paired image and the query image after sharing processing into the relative pose regression sub-network, and obtaining the image characteristics of each image in the image pair after processing of a decoding network part in the relative pose regression sub-network; splicing the image characteristics of the newly paired image and the query image to obtain spliced characteristics; and after the splicing characteristics are processed by the regression network part of the relative pose regression sub-network, outputting the predicted camera relative poses of the query image and the newly matched image.

In some embodiments, the fine search sub-network comprises a decoding network portion and a regression network portion; the initial prediction module is specifically configured to: inputting the image pair of the initial pairing image and the query image after sharing processing into the fine search sub-network, and obtaining the image characteristics of each image in the image pair after processing of a decoding network part in the fine search sub-network; splicing the image characteristics of the initial pairing image and the query image to obtain a splicing characteristic; and after the splicing characteristics are processed by a regression network part of the fine search sub-network, outputting the relative camera poses of the predicted query image and the initial pairing image.

In some embodiments, the initial retrieval module is further configured to: before the query image after sharing processing is processed by a rough retrieval sub-network to obtain the image characteristics of the query image, using a pre-trained camera to position a sharing sub-network and a rough retrieval sub-network in a neural network, and extracting the image characteristics of each image in the image database; and labeling the image characteristics of each image so as to perform image retrieval according to the image characteristics.

In some embodiments, the retrieving the initial paired image of the query image in the image database includes: retrieving a plurality of initial pairing images of the query image to obtain a plurality of new pairing images according to the plurality of initial pairing images; the positioning determination module is specifically configured to: and respectively obtaining the absolute camera pose of the query image according to each new pairing image of the new pairing images, and obtaining the absolute camera pose of the query image according to the absolute camera poses.

In a fourth aspect, a training apparatus for a camera localization neural network is provided, the apparatus including:

the system comprises an image acquisition module, a matching module and a matching module, wherein the image acquisition module is used for acquiring a plurality of groups of image pairs, each group of image pair comprises a query image and a matching image, the matching images and the query images respectively have corresponding absolute camera poses, and the image pairs also have labeling information of relative poses;

a relative prediction module for predicting a camera relative pose between the query image and the paired image for any image pair through a camera localization neural network;

an estimated pose module to determine a camera estimated pose of the query image based on the relative pose and a camera absolute pose of a paired image;

a new image module, configured to retrieve a new paired image of the query image from an image database according to the estimated pose of the camera of the query image, where the new paired image and the query image form a new image pair;

a pose prediction module for predicting the camera relative poses of the query image and the new paired images in the new image pair through a camera positioning neural network;

and the parameter adjusting module is used for adjusting network parameters of the camera positioning neural network based on the difference between the predicted information and the labeling information of the relative pose of the camera.

In some embodiments, the camera localization neural network comprises: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network; the sharing sub-network is used for carrying out image feature extraction processing on the query image, the initial pairing image and the new pairing image through the sharing sub-network to respectively obtain images after sharing processing; the device further comprises: the initial retrieval module is used for carrying out rough retrieval sub-network processing on the query image and the matched image after sharing processing to obtain an image relation parameter between the query image and the matched image; the relative prediction module is specifically configured to: processing the matched image and the query image after sharing processing through a fine retrieval sub-network, and outputting the prediction information of the relative poses of the matched image and the query image through the fine retrieval sub-network; the pose prediction module is specifically configured to: sharing the processed newly-paired image and the query image, and outputting the prediction information of the relative poses by the relative pose regression sub-network after the relative pose regression sub-network processes; the parameter adjusting module, when configured to adjust the network parameters of the camera positioning neural network, includes: and adjusting the network parameters of the sharing sub-network, the rough retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network according to the difference between the prediction information and the labeling information of the image relation parameters, the difference between the relative pose prediction information and the labeling information output by the fine retrieval sub-network and the difference between the relative pose prediction information and the labeling information output by the relative pose regression sub-network.

In some embodiments, the initial retrieval module, when configured to obtain the image relationship parameter between the query image and the paired image, comprises: and determining the predicted relative angle offset of the camera poses of the query image and the matched image as the image relation parameter according to the rotation poses of the query image and the matched image in a group of image pairs.

In some embodiments, the initial retrieval module, when configured to obtain the image relationship parameter between the query image and the paired image, comprises: grouping a plurality of image pairs corresponding to the same query image according to the difficulty degree of regression relative pose; respectively obtaining the image characteristic distance of each image pair in different groups; and obtaining a predicted value of hard sample mining loss according to the image characteristic distance, wherein the hard sample mining loss is used for expressing the relation between any image characteristic distances in different groups.

In a fifth aspect, an electronic device is provided, which includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement the camera positioning method according to any one of the embodiments of the present disclosure or implement the training method of the camera positioning neural network according to any one of the embodiments of the present disclosure when executing the computer instructions.

In a sixth aspect, a computer-readable storage medium is provided, on which a computer program is stored, the program, when being executed by a processor, implementing the method for positioning a camera according to any one of the embodiments of the present disclosure, or implementing the method for training a neural network for positioning a camera according to any one of the embodiments of the present disclosure.

According to the method and the device for training the camera positioning and the neural network, after the initial pairing image is obtained, the estimation pose of the query image is obtained according to the initial pairing image, retrieval is carried out based on the estimation pose, so that the poses of the newly retrieved pairing image and the query image are closer, the image pair with smaller pose deviation is more accurate in relative pose prediction, and the camera positioning result is more accurate.

Drawings

In order to more clearly illustrate one or more embodiments of the present disclosure or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in one or more embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive exercise.

Fig. 1 illustrates a flowchart of a camera positioning method according to at least one embodiment of the present disclosure;

fig. 2 illustrates a training flow of a camera localization neural network provided by at least one embodiment of the present disclosure;

fig. 3 illustrates another training procedure of a camera positioning neural network provided by at least one embodiment of the present disclosure;

FIG. 4 shows an illustration of the degree of view overlap of two images provided by at least one embodiment of the present disclosure;

fig. 5 illustrates a flowchart of a camera positioning method according to at least one embodiment of the disclosure;

fig. 6 shows a comparison schematic of the effect of obtaining a positioning result by using the camera positioning method provided by the present disclosure, provided by at least one embodiment of the present disclosure;

fig. 7 illustrates a schematic diagram of a camera positioning device provided in at least one embodiment of the present disclosure;

fig. 8 illustrates a schematic diagram of a camera positioning device provided in at least one embodiment of the present disclosure;

fig. 9 illustrates a schematic diagram of a camera positioning device provided in at least one embodiment of the present disclosure;

fig. 10 illustrates a schematic diagram of a camera positioning device according to at least one embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions in one or more embodiments of the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in one or more embodiments of the present disclosure, and it is apparent that the described embodiments are only a part of the embodiments of the present disclosure, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art based on one or more embodiments of the disclosure without inventive faculty are intended to be within the scope of the disclosure.

It is assumed that there is currently an image taken by a camera, for example, a camera provided on the robot, or a camera provided on the vehicle. The "camera positioning" of the embodiments of the present disclosure is to determine the absolute pose of the camera, such as the image taken in which pose state the camera is seen from the world coordinate system or the camera coordinate system itself. After the camera is positioned, the positioning of the equipment where the robot or the vehicle and the like are located can be realized at the same time. This is a way of positioning by means of images taken by the camera.

The camera positioning method provided by the disclosure is a relative positioning mode. For example, if a camera captures images F1 and F2, wherein the two images correspond to different camera poses, the camera pose corresponding to each image can be referred to as the "absolute camera pose" of the image, and the difference or correspondence between the camera poses corresponding to the two images can be referred to as the "relative camera pose" of the two images.

In the case of the image F1 and the image F2 of the above example, if the absolute pose of the camera of the image F1, the absolute pose of the camera of the image F2 and the absolute pose of the camera between the two images are known, the absolute pose of the camera of the image F1 can be determined according to the absolute pose of the camera between the two images and the absolute pose of the camera of the image F2. This may be referred to as a "relative positioning" positioning. The camera positioning method disclosed by the invention is to determine the absolute pose of the camera of a certain image and perform positioning according to the relative positioning mode. It can be seen from the above that the accurate acquisition of the relative pose of the camera between the two images is particularly important for the positioning of the camera.

Fig. 1 illustrates a flowchart of a camera positioning method according to at least one embodiment of the present disclosure, and as shown in fig. 1, the method may include the following processes:

in step 100, retrieving an initial pairing image of a query image in an image database; the absolute pose of the camera corresponding to the images in the image database is known.

In this step, the image database includes a plurality of images of a predetermined geographic area. For example, the predetermined geographic area may be an area determined according to the needs of the actual business product, such as, in the context of robotic navigation, the predetermined geographic area may be a predetermined robotic navigation area; for another example, in an autonomous driving scenario, the predetermined geographic area may be a predetermined smart driving location area; in other scenes, other areas such as a map navigation area may be used. The plurality of images in the image database are collected in the predetermined geographic area.

The image database can be constructed in various ways. Illustratively, a plurality of images can be acquired in a preset geographic area through an acquisition intelligent terminal provided with an acquisition camera, and the absolute poses of the cameras corresponding to the acquired images are determined; and constructing an image database corresponding to the preset geographic area according to the acquired images and the absolute poses of the cameras thereof. Or, the image database is constructed by selecting images corresponding to a predetermined geographic area from a known geographic database, and the absolute camera pose of each image is known.

In this step, the query image is an image to be subjected to camera positioning. The obtained image database can be used for retrieving an initial pairing image of the obtained query image, so that the initial pairing image and the query image can be used for positioning in the subsequent step according to the continuous processing of the initial pairing image and the query image.

In actual use, for a certain geographic area, an image database corresponding to the geographic area may be established in advance, and it is determined that the absolute pose of the camera of each image is known. Therefore, when the camera arranged on the vehicle or the robot or other types of equipment acquires an image in the area, the image can be used as a query image, an image database is searched based on the query image to obtain a matched image, and the absolute pose of the camera of the query image is further determined. Subsequent steps will continue to describe how the camera location is memorialized from the retrieved initial paired image.

In step 102, acquiring the relative poses of the initial pairing image and the query image; and determining a camera estimation pose of the query image according to the predicted camera relative pose.

In this step, the absolute pose of the camera of the image pair of the initial pairing image and the query image can be predicted, which can be called as predicting the relative pose of the camera. After the predicted camera relative pose is determined, wherein the camera absolute pose of the initial pairing image is known, the camera estimated pose of the query image can be obtained according to the camera absolute pose of the initial pairing image and the predicted camera relative pose. The camera estimated pose is equivalent to an estimated query image absolute pose.

In step 104, retrieving a new pairing image of the query image from the image database according to the estimated pose of the camera of the query image;

in this step, the image database retrieves the image to be paired with the query image. Since the retrieval is carried out according to the estimated camera pose of the query image obtained in the step 102, the retrieved new pairing image is closer to the query image in the absolute camera pose.

In step 106, the relative camera poses of the new pairing image and the query image are predicted.

In this step, the relative pose of the camera between the newly paired image and the query image can be predicted. Because the absolute camera pose of the new paired image is closer to the absolute camera pose of the query image than the initial paired image, the predicted relative camera pose between the new paired image and the query image is more accurate.

In step 108, the absolute camera pose of the query image is determined based on the relative camera poses of the new and query images and the absolute camera pose of the new pair image.

Compared with the traditional camera positioning process, the camera positioning process provided by the embodiment of the disclosure has the added parts that: and obtaining a predicted relative pose according to the initial pairing image and the query image, retrieving a new pairing image after obtaining the estimated pose of the query image, and predicting the relative pose based on the new pairing image. In the traditional mode, after the initial pairing image is retrieved, the absolute poses of the cameras of the initial pairing image and the query image are directly predicted according to the initial pairing image and the query image, and the absolute pose of the camera of the query image is calculated according to the absolute pose of the camera.

That is, if the process of "retrieving a paired image paired with a query image" is referred to as a "retrieval phase", the camera positioning method of the present disclosure improves the retrieval phase, and adopts a "two-phase retrieval method", in which the first phase is to retrieve an initial paired image according to image features, and the second phase is to retrieve a new paired image according to an estimated pose. The new retrieval mode is based on the estimated pose, so that the poses of the retrieved new paired images and the query image are closer, the image pair with smaller pose deviation is more accurate to predict the absolute pose of the camera, and the camera positioning result is more accurate.

According to the camera positioning method, after the initial pairing image is obtained, the estimated pose of the query image is obtained according to the initial pairing image, and retrieval is performed based on the estimated pose, so that the poses of the retrieved new pairing image and the query image are closer, the image pair with smaller pose deviation is more accurate to predict the relative pose, and the camera positioning result is more accurate.

The camera positioning method shown in fig. 1 may be implemented based on a neural network, and the type of the neural network may include, but is not limited to, a convolutional neural network, a cyclic neural network, a deep neural network, and the like. The present disclosure may provide a camera positioning neural network by which the above-described camera positioning method is performed.

The network structure of the camera positioning neural network can comprise: a structural portion for extracting image features, a structural portion for predicting the relative pose of the camera of one image pair by regression, and the like. For example, the camera localization neural network may include: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network, and a relative pose regression sub-network. Each sub-network may be formed by stacking and connecting one or more network sub-units (e.g., convolutional layers, non-linear layers, pooling layers, etc.) in a manner to construct a neural network. Wherein a sharing sub-network can be connected to the coarse retrieval sub-network, the fine retrieval sub-network, and the relative pose regression sub-network, respectively, the sharing sub-network belonging to the sub-networks shared by the coarse retrieval sub-network, the fine retrieval sub-network, and the relative pose regression sub-network.

The sharing sub-network can be used for carrying out image feature extraction processing on the image, and the processed image can be respectively processed through the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network. For example, the image after the sharing process may obtain the image feature of the image through a rough search sub-network, so as to search the initial paired image according to the image feature. For another example, sharing a processed image pair (e.g., the initial paired image and the query image) can predict the predicted relative pose between the initial paired image and the query image through the fine search sub-network; for another example, the shared processed image pair (e.g., the new pair image and the query image) may be used to predict the relative pose of the camera of the image pair through a relative pose regression sub-network. Before the initial matched image or the newly matched image is input into the fine retrieval sub-network or the relative pose regression sub-network, the image characteristics are extracted through a sharing sub-network, and a corresponding image after sharing processing is obtained.

Different sub-network parts in the camera positioning neural network can be trained and adjusted in parameters in a network training phase to obtain a better output result.

Fig. 2 will illustrate a training flow of a camera localization neural network:

training of the neural network it should be noted that in practical implementation, the training of the neural network may include training of other network parts in addition to the following processing.

In step 200, a plurality of sets of image pairs for training the neural network are acquired, each set of image pairs including a query image and a pair image.

In this step, pairs of images for training the neural network may be prepared, each pair including two images. One of the images may be referred to as the "query image" (i.e., the image to be camera-positioned) and the other as the "pairing image". The query image is fixed during the training process (also referred to as the anchor image), while the paired image may change during the training process, e.g., may subsequently change to a new paired image.

Further, a pairing image for pairing with the inquiry image, and the absolute pose of the camera of the inquiry image are both known. The absolute pose of the camera may include poses of three degrees of freedom for rotation R and three degrees of freedom for translation t, for a total of 6 degrees of freedom. For example, the absolute camera pose of the query image may be represented as R1, t1, and the absolute camera pose of the paired image may be represented as R2, t 2.

Based on the absolute poses of the cameras of the query image and the paired image, the absolute poses of the cameras of the two images can be obtained, and the absolute poses of the rotating camera and the translating camera are respectively expressed as delta q and delta t. The obtained absolute poses of the cameras of the two images can be called relative pose labeling information and used for being combined with the predicted information of the relative poses obtained in the subsequent steps to serve as the basis for adjusting network parameters.

In step 202, for any pair of images, the relative camera pose between the query image and the counterpart image is predicted.

For example, the regression model may be used to output the relative camera pose between the two images by regression using the image features of the query image and the pair image as inputs. The predicted camera relative pose obtained in the step is the predicted information of the absolute poses of the cameras of the query image and the pairing image.

In step 204, a camera estimate pose of the query image is determined based on the camera relative pose and the camera absolute pose of the paired image.

In step 206, a new pairing image of the query image is retrieved according to the camera estimation pose of the query image, the new pairing image and the query image forming a new image pair.

This step will retrieve a new pair of images for pairing with the query image again, based on the camera estimated pose. The retrieval is based on the pose, and an image closest to the estimated pose is retrieved from the camera estimated pose in the database and is called a new paired image.

In step 208, the camera relative poses of the query image and the new paired image in the new image pair are predicted. The absolute camera pose obtained in the step is the prediction information of the absolute camera pose of the query image and the newly paired image.

In step 210, network parameters are adjusted based on the difference between the predicted information of the absolute pose of the camera and the annotation information.

In this step, loss values of the respective absolute poses of the cameras can be respectively obtained based on the predicted relative poses and the labeling information of the cameras of the query image and the paired image and the predicted absolute poses and the labeling information of the cameras of the query image and the newly paired image, and network parameters are adjusted according to the loss values.

For example, the network parameters of the fine search subnetwork may be adjusted by predicting the predicted camera relative poses of the query image and the paired image in step 202 by the fine search subnetwork, and obtaining a loss value of the camera absolute pose based on the prediction information and the annotation information of the predicted camera relative pose. The loss function according to is as follows:

wherein the content of the first and second substances,

the relative pose prediction information representing the translation,

and representing the relative pose prediction information of the rotation. Delta_tRelative pose marking information, Δ, representing translation_qAnd representing the relative pose marking information of the rotation. As described above, annotation information may be calculated in step 200, and predictive information may be predicted in step 202. The above equation (1) uses the norm of L1.

For another example, the network parameters of the pose regression sub-network may be adjusted by predicting the absolute poses of the cameras of the query image and the newly paired image in step 208 through the pose regression sub-network, and obtaining the loss values of the relative poses based on the prediction information and the labeling information of the absolute poses of the cameras. The loss function according to is as follows:

wherein the content of the first and second substances,

relative pose prediction information (Relative translation) representing translation,

and Relative pose prediction information (Relative rotation) indicating the rotation. Delta'_tRelative pose annotation information, Δ ', representing translation'_qAnd representing the relative pose marking information of the rotation. As described above, annotation information may be calculated in step 200, and predictive information may be predicted in step 208. The formula (2) is basically the same as the formula (1), and the difference is that the image pair input by the pose regression sub-network is different from the image pair input by the fine search sub-network, the query image and the initial pairing image are input by the fine search sub-network, and the query image and the new pairing image are input by the pose regression sub-network, so that the relative pose is distinguished from the formula (1) in the symbolic representation.

According to the training method, when the matched image of the query image is retrieved, retrieval is carried out based on the estimated pose, the poses of the newly-retrieved matched image and the query image are closer, the image pair with smaller pose deviation is used for predicting the absolute pose of the camera more accurately, and the camera positioning result is more accurate.

Further, the fine search subnetwork and the pose regression subnetwork are obtained through the process training of fig. 2, and the fine search subnetwork and the pose regression subnetwork can be combined with the coarse search subnetwork to perform camera positioning according to the process of fig. 1, wherein the coarse search subnetwork can acquire the image features of the query image by using a network used in an existing "relative positioning" mode. Optionally, the rough-search sub-network may also be improved, so that the rough-search sub-network is more accurate in outputting image features for search, and the search accuracy of the initial paired images is also improved.

Based on this, fig. 3 illustrates another training method of the camera positioning neural network, please refer to fig. 3, wherein the neural network includes three training branches, as follows:

an ICR (Image-based Coarse Retrieval) Module, also called a Coarse Retrieval sub-network, is used for retrieving an initial pairing Image based on Image features.

And a PFR (position-based Fine Retrieval Module) Module is also called a Fine Retrieval sub-network and is used for obtaining a new paired image based on Pose Retrieval.

The prp (close Relative position Regression module) module is also called a Pose Regression sub-network, and is configured to perform accurate Relative Pose Regression based on the newly paired image and the query image to obtain an absolute Pose of the camera of the newly paired image and the query image.

In the method of the embodiment, the ICR module, the PFR module and the PFP module are trained, and the accuracy of the image retrieval of the ICR module is improved by using a plurality of loss functions in a combined manner. In addition, when the neural network is trained, the image pair is directly adopted for training, so that the ICR module in the training process does not need to retrieve the matched image; in addition, in the training phase, the ICR module and the PFR module are trained in parallel, and the two modules are executed in sequence in the network application process, for example, an initial pairing image of the query image is retrieved according to the ICR module, and then the relative camera pose between the initial pairing image and the query image is predicted through the PFR module.

The ICR module and the PFR module are retrieval parts for retrieving the paired images, and the paired images matched with the query images are acquired through retrieval of the retrieval parts. The PRP module regresses to obtain the absolute camera pose between the paired image and the query image based on the retrieval result of the retrieval part.

During this training process, the "image pair" is prepared in advance for training and does not need to be retrieved after ICR processing. Each image pair comprises two images, and each image marks the absolute pose of the camera in both rotation and translation. In addition, in the embodiment, the training of the ICR module adopts perspective superposition loss, angle offset loss and hard sample mining loss. Therefore, the view angle coincidence degree labeling information and the angle shift labeling information of each image pair are also calculated. The information about the visual angle coincidence degree and the angle deviation can be directly obtained by the ICR module.

The annotation information of the visual angle contact degree can be calculated as follows:

the camera intrinsic parameter is K, the pixel coordinates of the two images in the pair are X1 and X2, respectively, and the depth information of the two images is D1 and D2.

Where f is the focal length of the camera and (px, py) are the coordinates of the center point of the camera.

For two images in an image pair, if the pixel of the first image is projected into the three-dimensional space first and then projected into the second image, the corresponding pixel position of a certain pixel in the first image in the second image after the projection can be represented as follows:

the proportion that the pixels of the first image fall within the pixel range of the second image after projection is calculated, which can be called the visual angle coincidence degree d. When the visual angle coincidence loss training ICR module is used, visual angle coincidence exists between the two images in any image pair.

In the present embodiment, for any pair of images, if one of the two images in the pair is referred to as "first image", the other is referred to as "second image". The calculated annotation information includes: the degree of perspective overlap of the projection of the first image onto the second image, denoted d 1; further comprising: the degree of coincidence of the viewing angle at which the second image is projected onto the first image is denoted d 2.

Fig. 4 illustrates the viewing angle overlap of two images, wherein the top row in fig. 4 represents four sets of image pairs, each set of image pair comprising two images, and the viewing angle overlap of the two images has values of 0.76, 0.22, 0.12 and 0.42, respectively, which is the proportion of pixels falling within one image when projected onto the other image. This ratio is also the degree of coincidence between the two viewing cones illustrated in fig. 4.

The annotation information for the angular offset can be calculated as follows:

for any pair of images, where the two images correspond to two camera poses, the relative angular offset of the two camera poses can be calculated as follows:

after the image pair and the above-mentioned label information are prepared for training, the training process of the neural network is described with reference to fig. 3:

the Anchor image in FIG. 3 may be referred to as a query image during training, where the query image is an image that is fixed and unchanged during training, and the matching image is changed, for example, updated from an initial matching image to a new matching image. Target images (Coarse) are initially determined images for pairing with the query image, and may be temporarily referred to as initial pairing images. Each query image may have multiple initial paired images. In addition, the Shared encoder module of the three training branches is used for processing images to be input into the three training branches by using the same encoder module, so that the three Shared encoder modules shown in fig. 3 can be considered to have the same network parameters.

As shown in fig. 3, the query image and the corresponding initial paired images are processed by the Shared encoder module and then input to the ICR module and the PFR module. It should be noted that the image pair composed of the query image and each initial paired image is input to both the ICR module and the PFR module, that is, the inputs of the ICR module and the PFR module are the same, and are both the image pair composed of the query image and the initial paired images, and the number of the image pairs may be multiple sets that satisfy the training requirement.

The ICR module and PFR module are two training branches trained in parallel, and we describe the training of these two modules separately as follows:

for the ICR module, the present embodiment is trained with three loss functions, including: a view angle coincidence loss function, an angle offset loss function, and a hard-to-sample mining loss function. The ICR module comprises two parts in the training stage, wherein one part is a decoding network part ICR decoder, and the other part is a regression network part Regressor. The decoding network part still performs feature extraction on the image to obtain image features for retrieval, and the ICR module does not relate to retrieval in the training stage. It should be noted that the ICR module includes the above two parts ICR decoder (may be referred to as a decoding network part) and Regressor (may be referred to as a regression network part) in the training phase, but in the practical application phase of the neural network after the training is completed, the ICR module may include only the ICR decoder. The regression network part Regressor is used for directly predicting according to the image characteristics to obtain image relation parameters required by loss function calculation in the training stage: prediction information of view overlap ratio, and prediction information of relative angle offset. The loss value of the hard sample mining loss function can be calculated according to the image characteristics output by the ICR decoder, and can not be processed by a Regressor part of a regression network, and the loss value of the hard sample mining loss function can also be called as an image relation parameter.

For example, after the image pair composed of the query image and the initial pairing image is processed by the decoding network part, the image features of the query image and the image features of the initial pairing image can be obtained respectively. And (4) splicing the image characteristics of the query image and the image characteristics of the initial pairing image (concat), and inputting the image characteristics of the query image and the image characteristics of the initial pairing image into a regression network part Regressor. The regression network portion of the ICR module may output prediction information of view overlap and prediction information of relative angular offset. And calculating a loss value according to the prediction information and the previously obtained labeling information:

the formula (7) is to calculate the viewing angle overlapping degree loss by using the norm of L2, d1 is the annotation information of the viewing angle overlapping degree of the first image projected to the second image, and d2 is the annotation information of the viewing angle overlapping degree of the second image projected to the first image.

Is the prediction information of the degree of view overlap of the projection of the first image onto the second image,

is the prediction information of the degree of view overlap of the projection of the second image onto the first image.

Wherein, the formula (8) is to calculate the angular offset loss by using L2 norm, alpha is the labeling information of the relative angular offset between two images in the image pair,

is the prediction information of the relative angular offset between the two images in the pair.

As above, for multiple sets of image pairs input to the ICR module, the angular offset loss and the viewing angle overlap loss of the two images can be calculated for each set of image pairs. The prediction information of the view angle overlapping degree and the prediction information of the angle offset (also referred to as relative angle offset) obtained by the prediction of the regression network part may be referred to as image relation parameters of the two images.

For the computation of hard sample mining losses, it may be that groups of image pairs used for training are grouped. Grouping can be according to the difficulty degree of regressing the relative pose, for example, if the coincidence degree of the visual angles of two images in a group of image pairs is higher, the regression of the absolute pose of the camera is easier to perform, and the easier the regression of the relative pose is referred to as higher accuracy; if the difference degree of the absolute poses of the cameras of the two images is smaller, the regression of the relative poses is easier to perform. According to the criterion, the multiple groups of image pairs for training can be divided into three groups of easy, modified and hard, wherein the image pairs of the easy group can more easily perform relative pose regression, the visual angle coincidence degree of the image pairs is higher, and the absolute pose difference degree is smaller; the other two groups of relative poses can be difficult to regress.

With continued reference to fig. 3, after the two images in each group of image pairs are processed by the ICR decoder, the image characteristics of each image can be obtained. On the basis of the above three groups of easy, mode and hard, the image pair of the training set may be further divided into a plurality of batchs, each batch setting includes N query images, and then, for each query image, an initial paired image of the query image may be randomly extracted from the above three groups of easy, mode and hard. For example, taking one of the query images P as an example, P-P1 is a set of image pairs, and the image pair "P-P1" is located in the easy group; P-P2 is another set of image pairs, the pair "P-P2" being in the modete set; P-P3 is yet another set of image pairs, the pair "P-P3" being in the hard group; the P1, P2, P3 are three initial paired images corresponding to the same query image P. The P-P1, P-P2 and P-P3 image pairs belong to the same batch; the other query images in batch also extract the initial pair images in different groupings in the same manner as described above. This sample construction of Batch may be referred to as "Batch hard sampling". Alternatively, the division of the easy, modified and hard groups may be performed during the initial construction of the training set, and the forming of the Batch by the "Batch hard sampling" may also be performed after the image pair passes through the ICR decoder.

Based on the above-constructed batch, a hard sample mining penalty can be calculated, which is used to represent the relationship between arbitrary image feature distances in different groups:

as shown in the above formula (9), L_tripletIs a difficult loss of the excavation of the sample,

respectively, the image features of the ith anchor image and its corresponding easy, mode, hard image, which are the image features obtained by the ICR decoder in fig. 3. Where β is the threshold for the image feature distance, and (z) + -max (z, 0) is a maximization function (maximum function).

When adjusting the network parameters of the ICR module according to the difficult sample mining loss, the purpose of the adjustment is to expect that the distance between any two easy image pairs is close to the distance between the modified image pairs, and the distance between any modified image pair and the distance between hard image pairs is close. This type of method obtains the image feature distance of each image pair in different groups (easy, modified, hard), and obtains the prediction value of the hard sample mining loss according to the image feature distance of each image pair in the different groups, where the hard sample mining loss is also referred to as an image relation parameter between the obtained query image and the matched image, and this image relation parameter is a consideration of the relation between the image feature distances in the image pairs, for example, the distance between the easy image pair and the modified image pair is close. According to the meaning of the formula (9), when the model training is completed, the above-mentioned hard sample mining loss value L_tripletIs 0, in the training process of the model, the value of the mining loss of the difficult sample can be made to approach 0 by adjusting the network parameters.

As described above, in the present embodiment, the rough search sub-network (ICR module) uses three kinds of loss functions, a view overlap ratio loss function, an angle offset loss function, and a hard-to-sample mining loss function. In practical implementation, the ICR module may use more types of loss functions or a smaller number of loss functions, and is not limited to these three types of loss. For example, only the view overlap loss may be calculated for adjusting the ICR network parameters, or the view overlap loss and the angle offset loss may also be calculated. And will not be described in detail.

Looking next at the calculation of the loss function of the PFR module: referring to fig. 3, the PFR module may not divide different groups such as easy and hard, and the PFR decoder directly processes the groups of image pairs for training to obtain image features of the two images. For any image pair, after image features of two images in the image pair are spliced (concat), the image features are input into a regression network part Regressor of a PFR network, and then prediction information of absolute poses of cameras of the two images in the image pair can be directly output. The prediction information of the absolute pose of the camera comprises the prediction information of the relative pose of the translation

And relative pose prediction information of rotation

The annotation information is already calculated in the foregoing step 200, and the loss value of the relative pose of the PFR module is calculated according to the formula (1).

With continued reference to FIG. 3, the PFR module predicts the output relative pose prediction information

And

can be used to update Target images (Coarse), i.e. to update the counterpart image of the query image. For example, the estimated pose of the query image may be obtained from the absolute pose of the camera predicted by the PFR module and the absolute pose of the camera of the initial pairing image in the image pair. Based on the estimated pose, an image closest to the estimated pose may be retrieved from the database as a new pair of images for the query image. It should be noted that the "database" may include many images, and the sets of image pairs obtained before training may be randomly selected from the database to be compared with the query imageAnd (4) pairing, thereby forming an image pair for training of ICR and PFR modules. Here, after the PFR module predicts the absolute Pose of the camera, a retrieval (close-based retrieval) is performed, and some new images are retrieved from the database for matching with the query image, and these new matched images may be different from the previous initial matched images.

The new paired images (Target images (Fine)) are obtained based on pose retrieval, and compared with the initial paired images (Target images (Coarse)), the pose regression with the query images is more accurate. Please refer to fig. 3, the newly paired images and the images of the query image are input to the PRP module, the image features of each image in the image pair are obtained after the processing of the PRP decoder, and after the image features of the pair of image pairs are spliced (concat), the absolute poses of the camera for outputting the query image and the newly paired images, including the absolute pose of the translating camera and the absolute pose of the rotating camera, are directly predicted through the regression of the PRP module. And calculating the loss value of the relative pose according to a formula (2).

The calculation of the loss functions of the three network branches ICR, PFR and PFP and the respective network structures are described in detail above, wherein it can be seen that the three network branches have a Shared encoder module Shared encoder, and the network parameters of the network branches are adjusted individually. When adjusting parameters, the network parameters can be adjusted according to the visual angle contact ratio loss, the angle offset loss, the hard sample mining loss and the relative pose loss, and an overall loss function is defined as follows:

L＝L_frustum+L_angle+L_triplet+L_PFR+L_PRP......(10)

and (3) repeating for many times and adjusting parameters in the training process until the loss values reach a preset value range or reach a preset iteration number, and finishing the training.

Camera positioning using camera positioning neural network

In the trained neural network, the ICR decoder is included in the ICR module, and the regression network part shown in FIG. 3 is not included. When the trained neural network is applied to camera positioning, the flow principle of the trained neural network can be combined with the schematic diagram of fig. 1.

In addition, before applying the neural network, image features may be extracted for each image in the database using Shared encoder and ICR decoder in the trained network. In the database, each image can be marked with the absolute pose of the camera and the extracted image features, and the image features are used for retrieval.

Fig. 5 is a flowchart of a camera positioning method provided in the present disclosure, and as shown in fig. 5, the method may include:

in step 500, a query image to be camera positioned is received.

In this step, the query image may be an image to be subjected to camera positioning.

In step 502, an initial pairing image of the query image is retrieved according to the image feature of the query image.

In the step, the query image is subjected to encoding and decoding of an encoder-decoder to obtain image characteristics. And searching the database according to the image characteristics to obtain an initial pairing image, wherein the initial pairing image and the query image form an image pair.

In step 504, an estimated camera pose of the query image is obtained based on the predicted relative poses of the initial pair image and the query image.

In the step, an image pair formed by the initial pairing image and the query image is input into the PFR module, and the absolute poses of the cameras of the two images are obtained through prediction. And obtaining the camera estimation pose of the query image according to the predicted camera absolute pose and the camera absolute pose of the initial pairing image.

In step 506, a new pairing image of the query image is retrieved according to the camera estimation pose of the query image.

In the step, a new pairing image is obtained by searching the database according to the estimated pose of the camera, and the new pairing image and the query image form a new image pair.

In step 508, the camera relative poses of the new pairing image and the query image are predicted.

In this step, the PFP module predicts the relative poses of the cameras of the newly paired image and the query image according to the new image pair.

In step 510, a camera absolute pose of the query image is obtained based on the camera relative poses of the new paired image and the query image and the camera absolute pose of the new paired image.

According to the camera positioning method, after the paired images are retrieved based on the image characteristics, the pose of the query image is estimated according to the predicted relative pose, the retrieval based on the pose is performed once, and the two images in the new image pair after updating have closer poses, so that the regression effect of the relative poses is better, and the camera positioning result of the query image is more accurate.

Fig. 6 illustrates the comparison of the positioning results obtained by the camera positioning method provided by the present disclosure, as shown in fig. 6, wherein the green line represents the real camera trajectory, and the red line is the predicted camera trajectory predicted by the model. The camera positioning is carried out through different models such as Posenet, MapNet + + and the like, and the fact that the camera track obtained by the method (Ours) is closest to the real track can be found, namely the camera positioning effect of the method is better, and the positioning result is more accurate.

The neural network for positioning the camera obtained by training in the disclosure can be applied to various scenes, such as map navigation, positioning in an automatic driving system or robot navigation, and the like, and in any scene, the camera can be positioned according to images shot by the camera, so that the positioning of equipment where the camera is located is realized.

Fig. 7 is a schematic diagram of a camera positioning device provided by the present disclosure, which may include: an initial retrieval module 71, an initial prediction module 72, a re-retrieval module 73, a re-prediction module 74, and a location determination module 75.

An initial retrieval module 71, configured to retrieve an initial pairing image of the query image from the image database; the absolute pose of a camera corresponding to the image in the image database is known;

an initial prediction module 72, configured to obtain predicted camera relative poses of the initial pairing image and the query image; determining a camera estimation pose of the query image according to the predicted camera relative pose;

a re-retrieval module 73, configured to retrieve, from the image database, a new pairing image of the query image according to the estimated pose of the camera of the query image;

a re-prediction module 74 for predicting the relative camera poses of the new pairing image and the query image;

a positioning determination module 75, configured to determine an absolute camera pose of the query image based on the relative camera poses of the new paired image and the query image and the absolute camera pose of the new paired image.

In one example, as shown in fig. 8, the apparatus may further include: a first image obtaining module 76, configured to obtain a known geographic database before retrieving the initial paired image of the query image from the image database, where the known geographic database includes a plurality of images with known absolute poses of the camera; and selecting an image corresponding to a preset geographic area from the known geographic database to construct the image database.

In one example, the apparatus may further include: a second image acquisition module 77, configured to acquire a plurality of images in a predetermined geographic area through an acquisition intelligent terminal provided with an acquisition camera before an initial pairing image of a query image is retrieved from an image database; determining the absolute poses of the cameras corresponding to the acquired images respectively; and constructing an image database corresponding to the preset geographic area according to the acquired multiple images and the absolute poses of the cameras thereof.

In one example, the predetermined geographic area corresponding to each image of the image database is any one of the following types of areas: a map navigation area, an intelligent driving positioning area, or a robot navigation area.

In an example, the initial retrieving module 71 is specifically configured to: receiving the query image to be subjected to camera positioning; extracting image features of the query image; and retrieving the initial pairing image of the query image from the image database according to the image characteristics of the query image.

In one example, the camera localization apparatus includes a camera localization neural network; the camera positioning neural network includes: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network;

the sharing sub-network is used for respectively carrying out image feature extraction processing on the query image, the initial pairing image and the new pairing image to respectively obtain images after sharing processing;

the initial retrieving module 71, before being configured to retrieve the initial paired image of the query image, is further configured to: carrying out rough retrieval sub-network processing on the shared query image to obtain the image characteristics of the query image so as to retrieve the initial pairing image according to the image characteristics;

the initial prediction module 72, when configured to obtain the predicted relative camera pose of the initial pairing image and the query image, includes: processing the initial pairing image and the query image after sharing processing through the fine retrieval sub-network, and outputting the predicted camera relative poses of the initial pairing image and the query image through the fine retrieval sub-network;

the re-prediction module 73, when configured to predict the relative camera poses of the new pairing image and the query image, includes: and processing the newly-paired image and the query image after sharing processing through the relative pose regression sub-network, and outputting the relative poses of the cameras by the relative pose regression sub-network.

In one example, the relative pose regression sub-network includes a decode network portion and a regression network portion; the re-prediction module 73 is specifically configured to: inputting the image pair of the newly-paired image and the query image after sharing processing into the relative pose regression sub-network, and obtaining the image characteristics of each image in the image pair after processing of a decoding network part in the relative pose regression sub-network; splicing the image characteristics of the newly paired image and the query image to obtain spliced characteristics; and after the splicing characteristics are processed by the regression network part of the relative pose regression sub-network, outputting the predicted camera relative poses of the query image and the newly matched image.

In one example, the fine search sub-network includes a decoding network portion and a regression network portion; the initial prediction module 72 is specifically configured to: inputting the image pair of the initial pairing image and the query image after sharing processing into the fine search sub-network, and obtaining the image characteristics of each image in the image pair after processing of a decoding network part in the fine search sub-network; splicing the image characteristics of the initial pairing image and the query image to obtain a splicing characteristic; and after the splicing characteristics are processed by a regression network part of the fine search sub-network, outputting the relative camera poses of the predicted query image and the initial pairing image.

In one example, the initial retrieving module 71 is further configured to: before the query image after sharing processing is processed by a rough retrieval sub-network to obtain the image characteristics of the query image, using a pre-trained camera to position a sharing sub-network and a rough retrieval sub-network in a neural network, and extracting the image characteristics of each image in the image database; and labeling the image characteristics of each image so as to perform image retrieval according to the image characteristics.

In one example, the retrieving the initial pairing image of the query image in the image database includes: retrieving a plurality of initial pairing images of the query image to obtain a plurality of new pairing images according to the plurality of initial pairing images;

the positioning determining module 75 is specifically configured to: and obtaining the absolute pose of the camera of the query image according to each new pairing image of the new pairing images, and obtaining the absolute pose of the camera of the query image according to the absolute poses of the cameras.

Fig. 9 provides a training apparatus for a camera localization neural network, which may include, as shown in fig. 9: an image acquisition module 91, a relative prediction module 92, an estimated pose module 93, a new image module 94, a pose prediction module 95, and a parameter adjustment module 96.

The image obtaining module 91 is configured to obtain a plurality of sets of image pairs, where each set of image pair includes an inquiry image and a paired image, the paired image and the inquiry image respectively have corresponding absolute poses of the cameras, and the image pair further has labeling information of a relative pose;

a relative prediction module 92 for predicting a camera relative pose between the query image and the paired image for any image pair through a camera localization neural network;

an estimated pose module 93 for determining a camera estimated pose of the query image based on the relative pose and a camera absolute pose of a paired image;

a new image module 94, configured to retrieve, from an image database, a new paired image of the query image according to the estimated pose of the camera of the query image, where the new paired image and the query image form a new image pair;

a pose prediction module 95 for predicting the camera relative poses of the query image and the new paired images in the new image pair through a camera positioning neural network;

a parameter adjusting module 96, configured to adjust network parameters of the camera positioning neural network based on a difference between the predicted information of the relative pose of the camera and the annotation information.

In one example, as shown in fig. 10, a camera localization neural network includes: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network; the sharing sub-network is used for carrying out image feature extraction processing on the query image, the initial pairing image and the new pairing image through the sharing sub-network to respectively obtain images after sharing processing; the device further comprises: the initial retrieval module 97 is configured to perform rough retrieval on the query image and the paired image after sharing processing to obtain an image relationship parameter between the query image and the paired image; the relative prediction module 92 is specifically configured to: processing the matched image and the query image after sharing processing through a fine retrieval sub-network, and outputting relative pose prediction information of the matched image and the query image through the fine retrieval sub-network; the pose prediction module 95 is specifically configured to: sharing the processed newly-paired image and the query image, and outputting relative pose prediction information of the newly-paired image and the query image by the relative pose regression sub-network after the relative pose regression sub-network is processed; the parameter adjusting module 96, when configured to adjust the network parameters of the camera positioning neural network, includes: and adjusting the network parameters of the sharing sub-network, the rough retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network according to the difference between the prediction information and the labeling information of the image relation parameters, the difference between the relative pose prediction information and the labeling information output by the fine retrieval sub-network and the difference between the relative pose prediction information and the labeling information output by the relative pose regression sub-network.

In one example, the initial retrieving module 97, when configured to obtain the image relationship parameter between the query image and the paired image, includes: and determining the predicted relative angle offset of the camera poses of the query image and the matched image as the image relation parameter according to the rotation poses of the query image and the matched image in a group of image pairs.

In one example, the initial retrieving module 97, when configured to obtain the image relationship parameter between the query image and the paired image, includes: grouping a plurality of image pairs corresponding to the same query image according to the difficulty degree of regressing relative poses; respectively obtaining the image characteristic distance of each image pair in different groups; and obtaining a predicted value of hard sample mining loss according to the image characteristic distance, wherein the hard sample mining loss is used for expressing the relation between any image characteristic distances in different groups.

The embodiment of the present disclosure further provides an electronic device, where the device includes a memory and a processor, where the memory is used to store computer instructions executable on the processor, and the processor is used to implement the camera positioning method according to any one of the embodiments of the present disclosure or implement the training method for the camera positioning neural network according to any one of the embodiments of the present disclosure when executing the computer instructions.

One skilled in the art will appreciate that one or more embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program may be stored on the storage medium, and when the program is executed by a processor, the method for positioning a camera according to any of the embodiments of the present disclosure is implemented, or a method for training a neural network for positioning a camera according to any of the embodiments of the present disclosure is implemented.

The embodiments in the disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description of specific embodiments of the present disclosure has been described. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims can be performed in an order different than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this disclosure may be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this disclosure and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this disclosure can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this disclosure can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPG multi (field programmable gate array) or a multi-SIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Further, the computer may be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PD multi), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this disclosure contains many specific implementation details, these should not be construed as limitations on the scope of any disclosure or of what may be claimed, but rather as descriptions of features specific to particular embodiments of the disclosure. Certain features that are described in this disclosure in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present disclosure, and is not intended to limit the scope of the present disclosure, which is to be construed as being limited by the appended claims.

Claims

1. A camera positioning method, the method comprising:

2. The method of claim 1, wherein prior to retrieving the initial paired image of the query image in the image database, the method further comprises:

acquiring a known geographic database, wherein the known geographic database comprises a plurality of images with known absolute poses of cameras;

and selecting an image corresponding to a preset geographic area from the known geographic database to construct the image database.

3. The method of claim 1, wherein prior to retrieving the initial paired image of the query image in the image database, the method further comprises:

collecting a plurality of images in a preset geographic area through a collection intelligent terminal provided with a collection camera;

determining the absolute poses of the cameras corresponding to the acquired images respectively;

and constructing an image database corresponding to the preset geographic area according to the acquired images and the absolute poses of the cameras thereof.

4. The method according to claim 2 or 3, wherein the predetermined geographical area corresponding to each image of the image database is any one of the following types of area:

a map navigation area, an intelligent driving positioning area, or a robot navigation area.

5. The method of claim 1, wherein retrieving the initial paired image of the query image in the image database comprises:

receiving the query image to be subjected to camera positioning;

extracting image features of the query image;

and retrieving the initial pairing image of the query image from the image database according to the image characteristics of the query image.

6. The method of claim 1, wherein the camera localization method is performed by a camera localization device comprising a camera localization neural network;

the camera localization neural network includes: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network;

the query image, the initial pairing image and the new pairing image are subjected to image feature extraction processing through the sharing sub-network to respectively obtain images after sharing processing;

before retrieving the initial pairing image of the query image in the image database, the method further includes: sharing the processed query image, and performing rough retrieval sub-network processing on the query image to obtain the image characteristics of the query image so as to retrieve the initial pairing image according to the image characteristics;

the acquiring of the predicted camera relative pose of the initial pairing image and the query image comprises: the initial pairing image and the query image after sharing processing are processed by the fine retrieval sub-network, and the fine retrieval sub-network outputs the predicted relative poses of the camera and the query image;

the predicting the camera relative poses of the new pairing image and the query image comprises: and sharing the processed new pairing image and the query image, and processing the new pairing image and the query image through the relative pose regression sub-network, and outputting the relative poses of the cameras by the relative pose regression sub-network.

7. The method of claim 6, wherein the relative pose regression sub-network comprises a decoding network portion and a regression network portion;

the sharing processed new pairing image and the query image are processed by the relative pose regression sub-network, and the relative pose regression sub-network outputs the relative poses of the cameras of the new pairing image and the query image, and the sharing processing comprises the following steps:

the image pair of the newly paired image and the query image after sharing processing is input into the relative pose regression sub-network, and the image characteristics of each image in the image pair are obtained after the processing of the decoding network part in the relative pose regression sub-network;

splicing the image characteristics of the newly paired image and the query image to obtain spliced characteristics;

and after the splicing characteristics are processed by the regression network part of the relative pose regression sub-network, outputting the predicted camera relative poses of the query image and the newly matched image.

8. The method of claim 6, wherein the fine search sub-network comprises a decoding network portion and a regression network portion;

the shared initial pairing image and the shared inquiry image are processed by the fine search sub-network, and the predicted relative poses of the initial pairing image and the shared inquiry image are output by the fine search sub-network, wherein the processing comprises the following steps:

the image pair of the initial pairing image and the query image after sharing processing is input into the fine search sub-network, and the image characteristics of each image in the image pair are obtained after the processing of a decoding network part in the fine search sub-network;

splicing the image characteristics of the initial pairing image and the query image to obtain a splicing characteristic;

and after the splicing characteristics are processed by the regression network part of the fine search sub-network, outputting the predicted relative poses of the camera of the predicted inquiry image and the initial pairing image.

9. The method of claim 6, wherein before the query image after the sharing process is processed by a coarse search sub-network to obtain the image feature of the query image, the method further comprises:

using a pre-trained camera to locate a sharing sub-network and a coarse retrieval sub-network in a neural network, extracting image features for each image in the image database;

and labeling the image characteristics of each image so as to perform image retrieval according to the image characteristics.

10. The method of claim 1, wherein retrieving the initial paired image of the query image from the image database comprises: retrieving a plurality of initial pairing images of the query image to obtain a plurality of new pairing images according to the plurality of initial pairing images;

the determining an absolute pose of a camera of the query image includes: and respectively obtaining the absolute camera pose of the query image according to each new pairing image of the new pairing images, and obtaining the absolute camera pose of the query image according to the absolute camera poses.

11. A method for training a neural network for camera localization, the method comprising:

predicting, by a camera localization neural network, a camera relative pose between the query image and a counterpart image for any image pair;

determining a camera estimate pose of the query image based on the relative pose and a camera absolute pose of a paired image;

predicting, by a camera positioning neural network, a camera relative pose of the query image and the new paired image in the new image pair;

and adjusting network parameters of the camera positioning neural network based on the difference between the predicted information and the labeled information of the relative pose of the camera.

12. The method of claim 11,

the query image, the matching image and the new matching image are subjected to image feature extraction processing through the sharing sub-network to respectively obtain images after sharing processing;

after the acquiring of the plurality of sets of image pairs, the method further comprises: the shared query image and the shared matching image are processed through a coarse retrieval sub-network to obtain an image relation parameter between the query image and the matching image;

the paired image and the query image after sharing processing are processed by the fine retrieval sub-network, and the fine retrieval sub-network outputs the relative poses of the cameras;

sharing the processed new pairing image and the query image, and outputting the relative poses of the new pairing image and the query image by the relative pose regression sub-network through the relative pose regression sub-network;

the adjusting network parameters of the camera positioning neural network comprises: and adjusting the network parameters of the sharing sub-network, the rough retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network according to the difference between the predicted information and the labeling information of the image relation parameters, the difference between the predicted camera relative pose and the labeling information output by the fine retrieval sub-network and the difference between the predicted camera relative pose and the labeling information output by the relative pose regression sub-network.

13. The method according to claim 12, wherein the obtaining of the image relationship parameter between the query image and the pairing image comprises:

and determining the predicted relative angle offset of the camera poses of the query image and the matched image as the image relation parameter according to the rotation poses of the query image and the matched image in a group of image pairs.

14. The method according to claim 12, wherein the obtaining of the image relationship parameter between the query image and the pairing image comprises:

grouping a plurality of image pairs corresponding to the same query image according to the difficulty degree of regression relative pose;

respectively obtaining the image characteristic distance of each image pair in different groups;

and obtaining a predicted value of hard sample mining loss according to the image characteristic distance, wherein the hard sample mining loss is used for expressing the relation between any image characteristic distances in different groups.

15. A camera positioning device, the device comprising:

the initial prediction module is used for acquiring the relative poses of the initial pairing image and the prediction camera of the query image; determining a camera estimation pose of the query image according to the predicted camera relative pose;

the re-prediction module is used for predicting the relative camera poses of the new pairing image and the query image;

16. The apparatus of claim 15, further comprising:

the system comprises a first image acquisition module, a second image acquisition module and a query image generation module, wherein the first image acquisition module is used for acquiring a known geographic database before an initial pairing image of a query image is retrieved from the image database, and the known geographic database comprises a plurality of images with known absolute poses of a camera; and selecting an image corresponding to a preset geographic area from the known geographic database to construct the image database.

17. The apparatus of claim 15, further comprising:

the second image acquisition module is used for acquiring a plurality of images in a preset geographic area through an acquisition intelligent terminal provided with an acquisition camera before the initial pairing image of the query image is retrieved from the image database; determining the absolute poses of the cameras corresponding to the acquired images respectively; and constructing an image database corresponding to the preset geographic area according to the acquired images and the absolute poses of the cameras thereof.

18. The apparatus according to claim 16 or 17, wherein the predetermined geographical area corresponding to each image of the image database is any one of the following types of areas: a map navigation area, an intelligent driving positioning area, or a robot navigation area.

19. The apparatus of claim 15,

the initial retrieval module is specifically configured to: receiving the query image to be subjected to camera positioning; extracting image features of the query image; and retrieving the initial pairing image of the query image from the image database according to the image characteristics of the query image.

20. The apparatus of claim 15, wherein the camera localization apparatus comprises a camera localization neural network; the camera localization neural network includes: a sharing sub-network, a coarse retrieval sub-network, a fine retrieval sub-network and a relative pose regression sub-network; the sharing sub-network is respectively connected with the coarse retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network;

the initial retrieval module, before being configured to retrieve an initial pairing image of the query image, is further configured to: processing the shared query image through a coarse retrieval sub-network to obtain the image characteristics of the query image, and retrieving the initial pairing image according to the image characteristics;

the initial prediction module, when configured to obtain the predicted relative pose of the camera between the initial pairing image and the query image, includes: processing the initial pairing image and the query image after sharing processing through the fine retrieval sub-network, and outputting the predicted camera relative poses of the initial pairing image and the query image through the fine retrieval sub-network;

the re-prediction module, when configured to predict the relative camera poses of the new pairing image and the query image, includes: and processing the newly-paired image and the query image after sharing processing through the relative pose regression sub-network, and outputting the relative poses of the cameras by the relative pose regression sub-network.

21. The apparatus of claim 20, wherein the relative pose regression sub-network comprises a decoding network portion and a regression network portion;

the re-prediction module is specifically configured to: inputting the image pair of the newly-paired image and the query image after the sharing processing into the relative pose regression sub-network, and obtaining the image characteristics of each image in the image pair after the processing of the decoding network part in the relative pose regression sub-network; splicing the image characteristics of the newly paired image and the query image to obtain spliced characteristics; and after the splicing characteristics are processed by the regression network part of the relative pose regression sub-network, outputting the predicted camera relative poses of the query image and the newly matched image.

22. The apparatus of claim 20, wherein the fine search sub-network comprises a decoding network portion and a regression network portion;

the initial prediction module is specifically configured to: inputting the image pair of the initial pairing image and the query image after sharing processing into the fine search sub-network, and obtaining the image characteristics of each image in the image pair after processing of a decoding network part in the fine search sub-network; splicing the image characteristics of the initial pairing image and the query image to obtain a splicing characteristic; and after the splicing characteristics are processed by a regression network part of the fine search sub-network, outputting the relative camera poses of the predicted query image and the initial pairing image.

23. The apparatus of claim 20,

the initial retrieval module is further configured to: before the query image after sharing processing is processed by a rough retrieval sub-network to obtain the image characteristics of the query image, using a pre-trained camera to position a sharing sub-network and a rough retrieval sub-network in a neural network, and extracting the image characteristics of each image in the image database; and labeling the image characteristics of each image so as to perform image retrieval according to the image characteristics.

24. The apparatus of claim 15, wherein retrieving the initial paired image of the query image in the image database comprises: retrieving a plurality of initial pairing images of the query image to obtain a plurality of new pairing images according to the plurality of initial pairing images;

the positioning determination module is specifically configured to: and obtaining the absolute pose of the camera of the query image according to each new pairing image of the new pairing images, and obtaining the absolute pose of the camera of the query image according to the absolute poses of the cameras.

25. An apparatus for training a neural network for camera localization, the apparatus comprising:

the image acquisition module is used for acquiring a plurality of groups of image pairs, each group of image pairs comprises an inquiry image and a matched image, the matched images and the inquiry image respectively have corresponding absolute poses of the cameras, and the image pairs also have labeling information of relative poses;

a new image module, configured to retrieve a new pairing image of the query image from an image database according to the estimated pose of the camera of the query image, where the new pairing image and the query image form a new image pair;

a pose prediction module for predicting a camera relative pose of the query image and the new pairing image in the new image pair through a camera positioning neural network;

and the parameter adjusting module is used for adjusting the network parameters of the camera positioning neural network based on the difference between the predicted information and the labeling information of the relative pose of the camera.

26. The apparatus of claim 25,

the sharing sub-network is used for carrying out image feature extraction processing on the query image, the initial pairing image and the new pairing image through the sharing sub-network to respectively obtain images after sharing processing;

the device further comprises: the initial retrieval module is used for carrying out rough retrieval sub-network processing on the query image and the matched image after sharing processing to obtain an image relation parameter between the query image and the matched image;

the relative prediction module is specifically configured to: processing the matched image and the query image after sharing processing through a fine retrieval sub-network, and outputting relative pose prediction information of the matched image and the query image through the fine retrieval sub-network;

the pose prediction module is specifically configured to: sharing the processed newly-paired image and the query image, and outputting relative pose prediction information of the newly-paired image and the query image by the relative pose regression sub-network after the relative pose regression sub-network is processed;

the parameter adjusting module, when configured to adjust the network parameters of the camera positioning neural network, includes: and adjusting the network parameters of the sharing sub-network, the rough retrieval sub-network, the fine retrieval sub-network and the relative pose regression sub-network according to the difference between the prediction information and the labeling information of the image relation parameters, the difference between the relative pose prediction information and the labeling information output by the fine retrieval sub-network and the difference between the relative pose prediction information and the labeling information output by the relative pose regression sub-network.

27. The apparatus of claim 26, wherein the initial retrieving module, when configured to obtain the image relationship parameter between the query image and the matching image, comprises: and determining the predicted relative angle offset of the camera poses of the query image and the matched image as the image relation parameter according to the rotation poses of the query image and the matched image in a group of image pairs.

28. The apparatus of claim 26,

the initial retrieval module, when configured to obtain an image relationship parameter between the query image and the paired image, includes: grouping a plurality of image pairs corresponding to the same query image according to the difficulty degree of regression relative pose; respectively obtaining the image characteristic distance of each image pair in different groups; and obtaining a predicted value of hard sample mining loss according to the image characteristic distance, wherein the hard sample mining loss is used for expressing the relation between any image characteristic distances in different groups.

29. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 10 when executing the computer instructions or to implement the method of any one of claims 11 to 14.

30. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 10, or carries out the method of any one of claims 11 to 14.