CN111768498A

CN111768498A - Visual positioning method and system based on dense semantic three-dimensional map and mixed features

Info

Publication number: CN111768498A
Application number: CN202010654932.3A
Authority: CN
Inventors: 申抒含; 时天欣; 崔海楠; 朱灵杰
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-10-13

Abstract

The invention belongs to the field of visual positioning, and particularly relates to a visual positioning method and a system based on a dense semantic three-dimensional map and mixed features, aiming at solving the problems of lower robustness and accuracy of the existing visual positioning method under the condition of larger appearance change or photographing condition change. The method comprises the following steps: acquiring a dense three-dimensional model and a dense semantic three-dimensional model of a target scene; acquiring a plurality of candidate retrieval images of a query image; acquiring a matching relation between the query image and each candidate retrieval image and between the query image and each dense three-dimensional model; estimating a temporary pose based on the matching relation, projecting all visible three-dimensional points with semantics onto a query image, counting the number of the three-dimensional points consistent with the semantic tags of two-dimensional projection points on the query image as semantic consistency scores, and acquiring final positioning information by a pose calculation method based on weight RANSAC. The invention improves the robustness and the accuracy of video positioning.

Description

Visual positioning method and system based on dense semantic three-dimensional map and mixed features

Technical Field

The invention belongs to the field of visual positioning, and particularly relates to a visual positioning method and system based on a dense semantic three-dimensional map and mixed features.

Background

Currently, visual positioning methods can be mainly divided into three types, namely, methods based on image retrieval, methods based on deep learning, and methods based on three-dimensional models. Three-dimensional model-based methods can provide more accurate camera poses than the first two types of methods. Although conventional three-dimensional model-based positioning methods work well when the shooting environments of the query image and the database image are similar, in the case of a large change in scene appearance, for example, the query image and the database image are shot in different seasons, illumination or weather, these conventional positioning methods often fail to accurately position the query image. The main reason is that these methods need to acquire a large number of correct 2D-3D feature matches, so these methods rely heavily on the stability of local features, whereas traditional local features are very sensitive to changes in appearance and illumination, resulting in a large number of matching outliers that may be generated in a scene with a large time span, which in turn leads to a failure in visual positioning.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the existing visual positioning method is low in robustness and accuracy under a large change in appearance or a change in photographing conditions, a first aspect of the present invention provides a visual positioning method based on a dense semantic three-dimensional map and mixed features, the method comprising:

s100, acquiring a dense three-dimensional model and a dense semantic three-dimensional model which are constructed based on a database picture of a target scene;

step S200, acquiring a plurality of candidate retrieval images from the database picture through an image retrieval method for the input query image;

step S300, respectively establishing the matching relation between the query image and the dense semantic three-dimensional model through the feature matching relation between the query image and each candidate retrieval image established based on multiple feature points to obtain a first matching relation set;

step S400, acquiring an initial pose of the image acquisition device corresponding to the query image under each matching relation based on the first matching relation set to obtain an initial pose set;

step S500, acquiring a first point set and a second point set based on each initial pose, and counting the number of points with consistent semantic labels in the two point sets to serve as semantic consistency scores of corresponding candidate retrieval images; the first set of points are three-dimensional points in the dense semantic three-dimensional model that are viewable by the query image in the initial pose; the second point set is a two-dimensional projection point of the three-dimensional point on the query image;

and S600, acquiring the weight of each first matching relation based on the semantic consistency score of each candidate retrieval image, and visually positioning the image acquisition device corresponding to the query image by a pose calculation method based on the weight RANSAC.

In some preferred embodiments, the dense three-dimensional model and the dense semantic three-dimensional model in step S100 are constructed by:

establishing a dense three-dimensional model based on database pictures of a target scene; the database picture comprises a plurality of pictures of the target scene;

and performing semantic segmentation on the database picture, and acquiring a dense semantic three-dimensional model based on the dense three-dimensional model.

In some preferred embodiments, the method for "building a dense three-dimensional model based on database pictures of a target scene" comprises:

establishing a sparse three-dimensional model through an SfM algorithm based on the database picture;

and establishing a dense three-dimensional model through an MVS algorithm based on the sparse three-dimensional model.

In some preferred embodiments, the "plurality of feature points" in step S300 includes two feature points of SIFT and R2D 2.

In some preferred embodiments, in step S300, "the feature matching relationship between the query image and each of the candidate search images is established based on a plurality of feature points", and the method includes:

extracting a plurality of corresponding characteristic points from the query image and each candidate retrieval image respectively by a plurality of characteristic point extraction methods;

and establishing a 2D-2D feature matching relation between the query image and each candidate retrieval image based on the extracted feature points.

In some preferred embodiments, step S300 "establishing a matching relationship between the query image and the dense semantic three-dimensional model" includes:

and for each candidate retrieval image, acquiring the three-dimensional coordinates of the corresponding feature points according to the depth value of the corresponding depth map of the candidate retrieval image, and acquiring the 2D-3D matching relation between the query image and the dense semantic three-dimensional model based on the feature matching relation between the candidate retrieval image and the query image.

In some preferred embodiments, the initial pose in step S400 is obtained by:

and for each first matching relation, calculating the pose of the query image corresponding to the image acquisition device under the matching relation through a PnP algorithm.

In some preferred embodiments, in step S600, "obtaining a weight for each first matching relationship based on the semantic consistency score of each candidate search image" includes:

summing the semantic consistency scores of the candidate retrieval images to obtain a score sum;

and normalizing the semantic consistency score of each candidate retrieval image according to the score sum to obtain the weight of the corresponding candidate retrieval image.

In a second aspect of the present invention, a visual positioning system based on dense semantic three-dimensional map and mixed features is provided, the system comprising:

a first module configured to build a dense three-dimensional model based on a database picture of a target scene; the database picture comprises a plurality of pictures of the target scene;

the second module is configured to perform semantic segmentation on the database picture and acquire a dense semantic three-dimensional model based on the dense three-dimensional model;

the third module is configured to acquire a plurality of candidate retrieval images from the database pictures through an image retrieval method for the input query image;

a fourth module, configured to respectively establish a matching relationship between the query image and the dense semantic three-dimensional model through a feature matching relationship between the query image and each candidate retrieval image established based on multiple feature points, so as to obtain a set of first matching relationships;

a fifth module, configured to obtain an initial pose of the image acquisition device corresponding to the query image in each matching relationship based on the set of first matching relationships, to obtain an initial pose set;

a sixth module, configured to obtain the first point set and the second point set based on each of the initial poses, and count the number of points with consistent semantic labels in the two point sets, as semantic consistency scores corresponding to the candidate retrieval images; the first set of points are three-dimensional points in the dense semantic three-dimensional model that are viewable by the query image in the initial pose; the second point set is a two-dimensional projection point of the three-dimensional point on the query image;

and the seventh module is configured to acquire the weight of each first matching relation based on the semantic consistency score of each candidate retrieval image, and perform visual positioning on the image acquisition device corresponding to the query image by a pose calculation method based on weight RANSAC.

In some preferred embodiments, the fourth module further comprises a first feature extraction module, a second feature extraction module;

the first feature extraction module is configured to extract SIFT features of the picture;

the second feature extraction module is configured to extract R2D2 features of the picture.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned visual localization method based on dense semantic three-dimensional maps and mixed features.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described visual localization method based on dense semantic three-dimensional maps and mixed features.

The invention has the beneficial effects that:

the characteristics of manual design and the learning characteristics based on deep learning are jointly used, so that the advantages of the characteristics can be exerted under different environmental conditions, and the positioning accuracy and the robustness under various environments are further improved.

Each retrieved candidate image is assigned a semantic consistency score as a soft constraint to help pick out the more likely correct retrieved image.

The dense semantic three-dimensional model is used for replacing the sparse three-dimensional model, so that the model precision is guaranteed, the model can be adapted to all types of features, and the discrimination of semantic consistency scores can be improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic diagram of a visual positioning method based on a dense semantic three-dimensional map and mixed features according to an embodiment of the invention;

FIG. 2 is a schematic illustration of different types of three-dimensional models in one embodiment of the invention;

FIG. 3 is a diagram illustrating a comparison of two search methods in a challenging scenario, according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of matching interior points of SIFT and R2D2 in six different indoor and outdoor scenarios in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of the visible angles and distances of three-dimensional points according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Obtaining a robust and accurate visual positioning method under large appearance changes or photographing condition changes is still a great challenge at present. The positioning method based on the three-dimensional model is very dependent on local features, and the accurate pose can be obtained by using the PnP method only if the 2D-3D matching with enough quantity and accuracy is obtained. However, in a larger time span, the content captured by the query image and the database image often has a larger difference in appearance, which results in that the conventional method cannot obtain a sufficient number of correct matches, and thus cannot perform accurate positioning. In order to solve the problems, the invention provides a brand-new visual positioning algorithm, namely, based on a dense semantic three-dimensional model and combined use of manually designed features and deep learning-based features. The main idea of the present method comes from the results of two observations:

the first observation is that the hand-designed features and the learned features have their own advantages and their own areas of excellence in the three-dimensional model-based localization method. The manually designed features, such as SIFT, SURF, ORB, etc., have some invariance to scale and angle and can accommodate slight lighting variations and noise effects. But these features perform poorly in the face of large changes in environmental conditions and do not produce a sufficiently good match. Conversely, under such large environmental changes, such as day and night shifts, the features based on deep learning have better performance and can produce relatively more correct matches. However, such features have disadvantages in that they rely heavily on training data, which also affects their generalization performance to the extent that it can appear more diverse on different data. Besides, for detecting and describing learning features of this type, such as SuperPoint, D2-Net, R2D2, and the like, the feature key point detection accuracy is lower than that of manually designed features, such as SIFT. This also leads to a reduction in the accuracy of visual positioning to some extent. Therefore, combining the manually designed features and the learning features is a reasonable and effective way, because the advantages of the manually designed features and the learning features can be exerted under different environmental conditions, so that the positioning accuracy and the robustness under different environments can be improved.

The second observation is that the high-level semantic information of the image is a high-quality, stable and invariant representation of the scene compared to local features. Semantic information is largely unaffected by seasonal, weather, or other changes, and has begun to play an important role in visual targeting. Typically, semantic constraints are measured by projecting three-dimensional points of a model onto a query image, and then comparing whether semantic labels of the three-dimensional points in the model and semantic labels of two-dimensional points projected on the query image are consistent. Therefore, according to the measuring method, the more the number of the three-dimensional model point clouds is, the more accurate the model is, and the more discriminative the semantic consistency is.

Based on the two observations, the invention provides a method for using a dense semantic three-dimensional model and integrating mixed features into a three-dimensional model-based visual positioning. The use of a dense three-dimensional model has two advantages, the first being that different types of features can be tested using the same three-dimensional model, without the need to rebuild a new three-dimensional model each time for different types of features. The second advantage is that the dense semantic three-dimensional model can make the semantic consistency score more discriminative than the sparse three-dimensional model so as to assist in screening out more likely correct retrieval candidate images.

The invention discloses a visual positioning method based on a dense semantic three-dimensional map and mixed features, which comprises the following steps:

In order to more clearly describe the visual positioning method based on the dense semantic three-dimensional map and the mixed features, the following will expand the detailed description of the steps in one embodiment of the method in conjunction with the accompanying drawings.

As shown in fig. 1, in the visual positioning method based on the dense semantic three-dimensional map and the mixed features according to the embodiment of the present invention, the positioning information of the acquisition device corresponding to the query photo is obtained through the online operations of step S100 to step S600.

And S100, acquiring a dense three-dimensional model and a dense semantic three-dimensional model which are constructed based on the database picture of the target scene.

In this embodiment, the dense semantic three-dimensional model may be pre-constructed on-line. A database picture is a collection of pictures made up of a large number of photographs of a target scene. The construction method of the dense three-dimensional model and the dense semantic three-dimensional model comprises the following steps: establishing a dense three-dimensional model based on database pictures of a target scene; the database picture comprises a plurality of pictures of the target scene; and performing semantic segmentation on the database picture, and acquiring a dense semantic three-dimensional model based on the dense three-dimensional model.

The step of establishing the dense three-dimensional model specifically comprises: establishing a sparse three-dimensional model through an SfM algorithm based on the database picture; and establishing a dense three-dimensional model through an MVS algorithm based on the sparse three-dimensional model.

In the embodiment, in the process of establishing the dense three-dimensional model, according to the calibrated images provided in the SfM result, the MVS algorithm flow is executed to generate a depth map for each image and obtain fused dense point clouds.

Among the multiple geometry-based and deep learning-based MVS methods, the PatchMatch-based MVS method performs best on the evaluation platform of the famous MVS method. The present invention therefore uses a sophisticated PatchMatch-based MVS algorithm to generate the required dense three-dimensional model. The algorithm flow covers the selection of neighbor images, the depth map calculation based on a propagation method, the depth map filtering and the depth map fusion.

And each three-dimensional point in the dense three-dimensional model carries out maximum value voting according to the semantic label categories of the two-dimensional projection points of all the visible images of the three-dimensional point, and the category with the largest number of semantic label categories is allocated to the three-dimensional point to serve as the final semantic category of the three-dimensional point. According to the semantic categories of the three-dimensional points, the three-dimensional points of which the categories belong to the dynamic object can be removed from the model, for example, the three-dimensional points of pedestrians, automobiles, buses, sky and the like which are useless or even counterproductive to visual positioning in the model can be removed, so that a neater dense semantic three-dimensional model can be obtained (refer to fig. 2, schematic diagrams of different types of three-dimensional models, wherein the four pictures are taken from the same view angle. Compared with a sparse semantic three-dimensional model, the dense model has more three-dimensional points which can participate in the measurement of semantic consistency, so that the semantic consistency score has higher discrimination. It should be noted that the MVS method based on the patch match is a typical MVS method with depth map fusion, that is, in the calculation process of the MVS method, the depth map of each picture is calculated. Therefore, for each feature point in the image, whether a learning feature or a manually designed feature, the corresponding three-dimensional space point can be obtained by joint calculation of the internal and external parameters of the camera and the depth (if any) on the corresponding depth map.

Step S200, for the input query image, a plurality of candidate retrieval images are obtained from the database picture through an image retrieval method.

After a scene three-dimensional model and query pictures are obtained, firstly, a certain number of database pictures which are most similar to each query picture are searched for by using an image retrieval technology. In order to select a more suitable search method, two well-known image search methods, i.e., NetVLAD method and the lexical tree-based search method provided by COLMAP, were compared and tested in selecting a search method. The NetVLAD method learns the image representation end-to-end using a convolutional neural network, while another method is to obtain the representation of the image using a lexical tree and by spatial reordering. It has been found experimentally that the NetVLAD method performs better under large changes in lighting conditions, especially during daytime and nighttime changes, but often presents search errors when faced with similar or symmetrical structures often found in indoor scenes (see fig. 3, a comparison of the two search methods in challenging scenes; comparison of the first row with the third row shows that the NetVLAD method performs better in nighttime environments than the lexical tree-based method, whereas comparison of the second and fourth rows shows that the lexical tree-based method is better at dealing with search problems in indoor scenes where the structures are symmetrical or similar). As can be seen from the first and third lines of fig. 3, the NetVLAD method can retrieve a correct image in the top 5 search pictures at the daytime and nighttime transition, whereas the lexical tree-based method cannot retrieve correctly. As can be seen from the second row of fig. 3, the first 5 search pictures of the NetVLAD method all have search errors, wherein the first, second and fifth pictures are structurally symmetrical to the query image, and the third and fourth pictures are structurally similar to the query image but are not physically the same. The main reason for this phenomenon may be that the NetVLAD method learns the high-level structure information of the image, and the basic network weight of the method is obtained by training on the ImageNet data set, which contains a large amount of data augmentation, that is, operations such as turning over and symmetry of the image are performed to improve the diversification of the training data and improve the generalization capability of the model. It is due to these reasons that images with similar or symmetrical structures will obtain substantially consistent feature expressions after passing through the convolutional layer, which also causes the situation that images with similar or symmetrical structures are easy to search for errors.

Therefore, in order to increase the possibility of retrieving the correct picture, an intuitive way is to use both NetVLAD method and lexical tree based retrieval method to ensure that the correct picture is retrieved as much as possible under different circumstances. However, in practical tests, it is found that in most cases, the mixed use of the two retrieval methods reduces the percentage of the correct retrieval image in the whole retrieval image to a certain extent compared with the use of only NetVLAD, which directly results in the reduction of the number of matching interior points and the reduction of the positioning accuracy. Therefore, the invention finally selects a method using NetVLAD as an image retrieval algorithm in the algorithm flow.

Step S300, respectively establishing the matching relation between the query image and the dense semantic three-dimensional model through the feature matching relation between the query image and each candidate retrieval image established based on the multiple feature points to obtain a first matching relation set.

(1) Feature selection

In practical applications, the visual positioning needs to be robust enough to cope with various environmental condition changes, such as illumination changes, weather changes, seasonal changes, even day and night changes, and shooting visual angle changes. To date, there are still a number of approaches that tend to extract feature keypoints for three-dimensional reconstruction and visual localization tasks using typical hand-designed features, such as SIFT. The SIFT descriptor is implemented by integrating high-dimensional vectors representing image gradients in local regions of an image, is not affected by image scaling and image rotation, and can provide robust matching in most conventional cases. However, when the scene environment conditions change greatly, the image bottom layer information used by the SIFT is easily affected by the appearance change, so that the detection of the key points of the SIFT is not stable any more, which may result in failure of feature matching and further result in failure in correctly positioning the query image.

In recent years, with the rapid development of learning features based on convolutional neural networks, the learning-based features begin to exhibit better visual localization performance than the traditional hand-designed features under challenging environmental conditions. These learning-based features utilize deep neural networks in a data-driven manner to learn how to extract feature keypoints or describe features. Compared with the SIFT features only considering partial small regions of the image, the learning-based features can utilize more information of the image, such as colors, larger image regions, higher-level image structures, scene layout and the like, so that the learning features have better feature matching performance than the SIFT features under challenging imaging conditions.

Next, selecting which learning-based feature is the primary problem to solve. The embodiment of the invention can select three learning features with excellent performance, namely D2-Net, R2D2 and SuperPoint, as candidate features, which are open sources and are listed before the list of the local feature challenge match. In order to select the better learning features, SIFT and the three learning features are respectively tested on the Aachen Day-Night data set of the large-time span visual positioning reference by using the method of the invention. The emphasis of the test data set is on using a three-dimensional model constructed from only daytime database images, but requires locating both daytime and nighttime query images. From experimental results, SIFT has higher positioning accuracy for query images in the daytime and R2D2 performs better in the nighttime condition under the same dense semantic three-dimensional model. However, in practice, it is not easy to find a clear boundary line according to the standards such as the illumination intensity or the time line to distinguish between day and night, so that an effective way to combine SIFT and R2D2 is to cope with different environmental conditions. The experimental results also demonstrate that the combined use not only does not offset their respective advantages, but also achieves the highest level of localization performance on the data set. In addition, it has been found experimentally that the spatial distribution of the mixed feature matching points on the query image is much broader than the distribution using only one of the features (refer to fig. 4, the matching interior point diagrams of SIFT and R2D2 in six different indoor and outdoor scenes. The more extensive distribution of features means that the probability of obtaining a sufficient number of correct 2D-3D matches can be improved. This also indicates that these two types of features can play a complementary role, and also demonstrates that it is reasonable and efficient to mix the two features.

(2) Obtaining of first matching relationship

Extracting a plurality of corresponding characteristic points from the query image and each candidate retrieval image respectively by a plurality of characteristic point extraction methods; performing 2D-2D feature matching relation between the query image and each candidate retrieval image based on the extracted feature points; and for each candidate retrieval image, acquiring three-dimensional coordinates of corresponding feature points according to the depth value of the corresponding depth map of the candidate retrieval image, and acquiring a 2D-3D matching relation between the query image and the dense semantic three-dimensional model based on the feature matching relation between the candidate retrieval image and the query image and the dense semantic three-dimensional model.

In this embodiment, after a series of candidate retrieval images are obtained, semantic segmentation is performed on each query image, and feature matching is established between the query image and the retrieval image by using two features of SIFT and R2D2 as mixed features. Only one retrieval image is used each time, 2D-2D feature matching between the query image and the retrieval image is calculated firstly, then according to the coordinates of matching points on the retrieval image, depth values are obtained on a depth map (an intermediate product in an MVS flow) corresponding to the retrieval image, corresponding three-dimensional point coordinates are calculated, and then the 2D-3D matching relation between the query image and the dense three-dimensional model can be obtained.

And S400, acquiring the initial pose of the image acquisition device corresponding to the query image under each matching relation based on the first matching relation set to obtain an initial pose set.

The initial pose acquisition method comprises the following steps: and for each first matching relation, calculating the pose of the query image corresponding to the image acquisition device under the matching relation through a PnP algorithm.

In this embodiment, based on the 2D-3D matching relationship between the query image and the dense three-dimensional model, a temporary pose (i.e., an initial pose) of the query image can be restored by executing the PnP algorithm. And then, projecting all three-dimensional points in the model which can be seen under the current pose onto the query image according to the estimated temporary pose, and counting the number of the semantic tags of the three-dimensional points which are consistent with the semantic tags of the two-dimensional projection points of the three-dimensional points on the query image. Before projecting the three-dimensional points, the three-dimensional points which meet the conditions and can be seen by the current temporary pose need to be screened out. The visible three-dimensional points should satisfy the following two constraints:

d_min<‖v‖<d_max,∠(v,v_m)<θ (1)

wherein, C_QCoordinates of optical center of camera representing temporary pose of query image, X represents coordinates of three-dimensional point, d_minRepresenting the minimum distance between a three-dimensional point and all the optical centers of the cameras that can see it, d_maxIt represents the maximum distance between the three-dimensional point and all the camera optical centers that can see the three-dimensional point. Theta denotes the two most marginal lines of sight v_lAnd v_uThe included angle therebetween. The specific physical meanings of these variables are shown in FIG. 5, where X represents a three-dimensional point, C₁、C₂、C₃Representing three cameras able to see this three-dimensional point, v_lAnd v_uAnd respectively representing the minimum and maximum visual distances of the three-dimensional point X, and theta represents the visual included angle of X. These two constraints mean that only if the query image and the database image can see a certain three-dimensional point at the same time from similar angles and distances, the three-dimensional point is considered to be visible by the temporary pose in which the current query image is located, and can be used for projection.

Step S500, acquiring a first point set and a second point set based on each initial pose, and counting the number of points with consistent semantic labels in the two point sets to serve as semantic consistency scores of corresponding candidate retrieval images; the first point set is a three-dimensional point of the image acquisition in the initial pose capable of acquiring the dense semantic three-dimensional model; the second set of points is a two-dimensional projection point of the three-dimensional point on the query image.

And after all visible three-dimensional points are projected, counting the number of semantic categories of the three-dimensional points which are consistent with the semantic categories of the two-dimensional points projected on the query image. The number of semantic consistency is used as the semantic consistency score of the retrieval picture. Intuitively, if the pose of the query image is calculated by using the wrong retrieval image, the calculated pose is also definitely wrong, so that the semantic score of the retrieval image is low, and the correct retrieval image has a high semantic consistency score. Therefore, whether the retrieval image is correct or not can be measured to a certain degree according to the size of the semantic score value. However, since there is a large difference in the number of three-dimensional points that can be projected from different scenes, a proper and fixed semantic consistency score threshold cannot be found to distinguish between correct and incorrect retrieved images. Therefore, the invention puts semantic consistency in the final pose estimation process as a soft constraint, so that the feature matching generated by correctly searching the image can be selected with higher probability, thereby improving the number of matching interior points and further improving the positioning accuracy and the success rate.

And after each retrieval image is endowed with a semantic consistency score, all the 2D-3D matching feature points corresponding to the image are endowed with the same semantic score as the retrieval image. And finally, putting all 2D-3D matching feature points with semantic scores generated by all the retrieval images into a RANSAC pose estimation process based on weight. Specifically, the semantic scores of all 2D-3D matches are first summed, and then the semantic scores of each pair of 2D-3D matches are normalized to a weight w according to the sum of the scores, i.e. each pair of 2D-3D matches will be selected in the weight RANSAC process with their corresponding probability w. This also means that a correctly retrieved image yields a 2D-3D match that is to be decimated with a higher probability, while a wrongly retrieved image yields a 2D-3D match that is to be decimated with a lower probability. Compared with the 2D-3D matching with low semantic score by directly removing, the method takes the 2D-3D matching as soft constraint, ensures that the matching interior points are improved, and can robustly respond to semantic ambiguity.

A visual positioning system based on a dense semantic three-dimensional map and mixed features according to a second embodiment of the present invention includes:

In this embodiment, the fourth module further includes a first feature extraction module and a second feature extraction module; the first feature extraction module is configured to extract SIFT features of the picture; and the second feature extraction module is configured to extract R2D2 features of the picture.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the visual positioning system based on the dense semantic three-dimensional map and the mixed features provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the foregoing function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the invention has stored therein a plurality of programs adapted to be loaded and executed by a processor to implement the above-described visual localization method based on dense semantic three-dimensional maps and hybrid features.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described visual localization method based on dense semantic three-dimensional maps and mixed features.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A visual positioning method based on dense semantic three-dimensional map and mixed features is characterized by comprising the following steps:

step S300, respectively establishing the matching relation between the query image and the dense three-dimensional model through the feature matching relation between the query image and each candidate retrieval image established based on multiple feature points to obtain a first matching relation set;

2. The visual positioning method based on the dense semantic three-dimensional map and the mixed features as claimed in claim 1, wherein the dense three-dimensional model and the dense semantic three-dimensional model in step S100 are constructed by:

3. The visual positioning method based on the dense semantic three-dimensional map and the mixed features as claimed in claim 1, wherein the method of "establishing the dense three-dimensional model based on the database picture of the target scene" comprises the following steps:

4. The visual positioning method based on the dense semantic three-dimensional map and the mixed features as claimed in claim 1, wherein the "multiple feature points" in step S300 include two feature points of SIFT and R2D 2.

5. The visual positioning method based on the dense semantic three-dimensional map and the mixed features as claimed in claim 1, wherein the step S300 "feature matching relationship between the query image and each candidate search image established based on a plurality of feature points" comprises:

6. The visual positioning method based on the dense semantic three-dimensional map and the mixed features as claimed in claim 5, wherein the step S300 of establishing the matching relationship between the query image and the dense semantic three-dimensional model comprises the following steps:

and for each candidate retrieval image, acquiring three-dimensional coordinates of corresponding feature points according to the depth value of the corresponding depth map of the candidate retrieval image, and acquiring a 2D-3D matching relation between the query image and the dense semantic three-dimensional model based on the feature matching relation between the candidate retrieval image and the query image.

7. The visual positioning method based on the dense semantic three-dimensional map and the mixed features as claimed in claim 1, wherein the initial pose in step S400 is obtained by:

8. The visual positioning method based on the dense semantic three-dimensional map and the mixed features as claimed in claim 1, wherein in step S600, "obtaining the weight of each first matching relationship based on the semantic consistency score of each candidate retrieval image" comprises:

9. A visual positioning system based on dense semantic three-dimensional maps and hybrid features, the system comprising:

10. The dense semantic three-dimensional map and hybrid feature based visual positioning system of claim 9, wherein the fourth module further comprises a first feature extraction module, a second feature extraction module;

11. A storage device having stored therein a plurality of programs, wherein said programs are adapted to be loaded and executed by a processor to implement the method of visual localization based on dense semantic three-dimensional maps and mixed features of any one of claims 1 to 8.

12. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement a visual localization method based on dense semantic three-dimensional maps and mixed features according to any of claims 1 to 8.