WO2018235923A1

WO2018235923A1 - Position estimating device, position estimating method, and program

Info

Publication number: WO2018235923A1
Application number: PCT/JP2018/023697
Authority: WO
Inventors: 清晴相澤; 和也石見
Original assignee: 国立大学法人東京大学
Priority date: 2017-06-21
Filing date: 2018-06-21
Publication date: 2018-12-27

Abstract

This position estimating device accepts moving image data including a series of frames obtained by capturing images of a subject on a movement pathway while moving, extracts a feature point relating to the subject captured in each frame included in the moving image data, and creates a reconfiguration map in which the feature point is associated with coordinates in a reconfiguration space, which is a prescribed virtual three-dimensional space. The position estimating device retrieves a reference feature point corresponding to the feature point in reference image data associated with position information, acquires a transformation relationship between a world coordinate system, which is a coordinate system of the position information, and a coordinate system of the coordinate associated with the feature point, and uses the transformation relationship to correct the reconfiguration map by performing correction to scale of an amount of movement of a camera which captured the moving image data, while suppressing a change in an estimated result of a position and attitude of the camera, represented using the coordinates in the reconfiguration space, and by performing correction such that the position and attitude of the camera approach the position information associated with the reference image data.

Description

Position estimation device, position estimation method, and program

The present invention relates to a position estimation device, a position estimation method, and a program.

In recent years, with the development of automatic driving technology for automobiles, an infrastructure for collecting moving image data (traveling images) captured while traveling on roads in various countries is being developed from various devices equipped with cameras.

At present, such a large number of traveling videos can only be viewed as a simple video, but meaningful information is extracted from these traveling videos, and an image included in each traveling video and the image are captured. In addition, if it is possible to associate with an accurate position on a map, it is expected that an advanced geographical information system (GIS) can be constructed, which contributes to the development of automatic driving technology and robotics.

As described above, in recent years, it is desirable to perform traveling position estimation in traveling images that can be applied to a large number of traveling images.

By the way, the method of estimating the absolute position of the image using the image with the geotag is a position with relatively high accuracy if the correspondence between the images can be sufficiently obtained compared to the method using the road network or the satellite image There is an advantage that estimation is possible. On the other hand, the correspondence between the input traveling video and the geotagged image may be difficult due to the influence of the change of the illumination environment or the angle of view taken. If this association is not appropriate, position estimation can not be performed, or estimation with the required accuracy can not be performed.

The present invention has been made in view of such actual circumstances, and for each frame image of a traveling image including a place where it is not possible to directly associate with a geotagged image, information of corresponding positions is provided. It is an object of the present invention to provide a position estimation device, a position estimation method, and a program that can be set.

In Non-Patent Document 1, three-dimensional information of a three-dimensional object captured in a traveling image is estimated (reconstructed) by SLAM (Simultaneous Localization and Mapping), and a three-dimensional reconstruction map obtained by this estimation is used. It is disclosed that the position estimation of the traveling image is performed by deforming it as an object in the world coordinate system, but in the method of this non-patent document 1, the error accumulates as the traveling distance becomes longer, and the position It is known that the estimation accuracy is reduced.

The present invention for solving the problems of the prior art is a position estimation apparatus, which receives moving image data including a series of frames obtained by imaging an object on a moving path while moving; Reference image data in which at least one of the frames included in the moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined world representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the coordinate system, feature points related to the subject imaged in each frame included in the moving image data, and extracting the feature points Reconstruction means for generating a reconstruction map associated with coordinates in a reconstruction space which is a virtual three-dimensional space; and corresponding to the feature point in the reference image data The coordinate system of the position information based on the search means for searching the reference feature point, the searched reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point A relationship acquiring unit that acquires a conversion relationship between the world coordinate system and the coordinate system of the coordinates associated with the feature point; and a conversion unit that corrects the reconstruction map using the conversion relationship. A correction for scaling the amount of movement of the camera while suppressing a change in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data; The position and orientation of the image are corrected to be close to the position information associated with the reference image data to correct the reconstruction map.

According to the present invention, information of corresponding positions can be set for each frame image of a traveling video including a place where it is not possible to directly associate with a geotagged image.

It is a block diagram showing the example of composition of the position estimating device concerning an embodiment of the invention. It is a functional block diagram showing the example of the position estimating device concerning an embodiment of the invention. It is an explanatory view showing an example of a pose graph which a position estimating device concerning an embodiment of the invention uses. It is a flowchart figure showing the operation example of the position estimating device concerning an embodiment of the invention. It is explanatory drawing showing the comparison of the Example of this invention, and a prior art example. It is an explanatory view showing the effect of the example of the present invention.

Embodiments of the present invention will be described with reference to the drawings. The position estimation device 1 according to the embodiment of the present invention is, as illustrated in FIG. 1, a control unit 11, a storage unit 12, an operation unit 13, a display unit 14, a communication unit 15, and an interface unit 16. And including.

The control unit 11 is a program control device such as a CPU, and operates in accordance with a program stored in the storage unit 12. In the present embodiment, the control unit 11 receives moving image data including a series of frames obtained by capturing an object on a moving path while moving. The control unit 11 is reference image data in which at least one subject imaged in the target frame is imaged, with at least one of the frames included in the moving image data received here as the target frame, and is imaged in advance Reference image data associated with position information in world coordinates representing the position is acquired.

Further, the control unit 11 extracts feature points related to the subject captured in each frame included in the received moving image data, and coordinates the feature points in the reconstruction space, which is a predetermined virtual three-dimensional space. While processing as SLAM (Simultaneous Localization and Mapping) to be associated with, the reference feature point corresponding to the feature point is searched in the reference image data.

The control unit 11 sets the retrieved reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point (the coordinate system of this feature point, that is, the coordinate system in the reconstruction space is Based on the SLAM coordinate system, the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system is acquired.

The control unit 11 then converts the coordinates in the SLAM coordinate system associated with the feature points imaged in each frame into the values of the coordinates in the world coordinate system, using this conversion relationship. Here, the world coordinate system is represented by a three-dimensional coordinate system (x, y, z), and the (x, z) plane is a universal horizontal Mercator (UTM) orthogonal coordinate which is an orthogonal plane coordinate system in meters. It corresponds to the system. The y-axis corresponds to the altitude from the ground plane (also in meters). As is widely known, values on the UTM Cartesian coordinate system can be converted to coordinate values of latitude and longitude pairs.

In the present embodiment, when acquiring the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system in this embodiment, the control unit 11 in the SLAM coordinate system of the camera which captured moving image data. A correction for scaling the movement amount of the camera while suppressing a change in the estimation result of the position and orientation represented by the coordinates, and a correction for bringing the position and orientation of the camera close to the position information associated with the reference image data Then, using the coordinates of the feature point represented by the coordinates of the SLAM coordinate system after correction and the values of the coordinates of the corresponding reference feature point in the world coordinate system, the world coordinate system and the SLAM coordinate system Get conversion relation. The detailed operation of the control unit 11 will be described later.

The storage unit 12 is a memory device, a disk device, or the like, and holds a program executed by the control unit 11. The storage unit 12 also operates as a work memory of the control unit 11.

The operation unit 13 is a keyboard, a mouse, or the like, receives an instruction operation of the user, and outputs the content of the instruction operation to the control unit 11. The display unit 14 is a display or the like, and displays and outputs information in accordance with an instruction input from the control unit 11.

The communication unit 15 is a network interface or the like, and transmits information such as a request via the network in accordance with an instruction input from the control unit 11. The communication unit 15 also outputs the information received via the network to the control unit 11. The communication unit 15 is used, for example, when acquiring a reference image such as a geotag image from a server on the Internet. The interface unit 16 is, for example, a USB interface or the like, and outputs moving image data input from a camera or the like to the control unit 11.

The control unit 11 according to the present embodiment is functionally illustrated in FIG. 2 as an acceptance unit 21, a SLAM processing unit 22, a reference image data acquisition unit 23, a search unit 24, and a conversion relationship acquisition unit. 25 and the conversion processing unit 26.

The receiving unit 21 receives moving image data including a series of frames obtained by capturing an object on a moving path with the camera while moving the camera. This camera may be a monocular camera (that is, a camera that does not acquire information in the depth direction), and therefore, it is assumed that frames included in captured moving image data do not include depth information.

In the present embodiment, the control unit 11 restores a three-dimensional map by SLAM using the received moving image data, and deforms the obtained three-dimensional restoration map into a world coordinate system. In this way, position information is associated with all frames of moving image data, including frames obtained by imaging locations such as geotags that can not be directly associated with an image associated with position information related to the world coordinate system.

Further, in the processing of restoring a three-dimensional map from moving image data in the present embodiment, the scale drift is improved in consideration of a problem that a scale error gradually accumulates during the restoration processing (scale drift problem). While mapping to the world coordinate system.

The SLAM processing unit 22 extracts a feature point related to the subject captured in each frame included in the moving image data, and associates the feature point with the coordinates in the reconstructed three-dimensional space (SLAM process). Run. Then, the SLAM processing unit 22 issues unique feature point identification information for each feature point, the feature point identification information, information for identifying a frame from which the feature point is extracted, and the three-dimensional space of the feature point. Are associated with each other and stored in the storage unit 12. For example, this process is described in ORB-SLAM (R. Mur-Atral, et. Al., "ORB-SLAM: a versatile and accurate monocular slam system", IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147). -1, 2015, etc., and the contents of the specific processing are widely known, so the description here is omitted.

Hereinafter, information in which the feature point identification information generated here is associated with the coordinate values in the coordinate system (SLAM coordinate system) of the virtual three-dimensional space is hereinafter referred to as a SLAM map. In this SLAM process, a key frame (KF) is selected from all the frames, and coordinates representing the position of each feature point in the key frame (two-dimensional coordinates representing the position of the pixel in the image of the frame, Hereinafter, this coordinate system is referred to as a frame coordinate system), and information Cfp-kf is obtained in which values of three-dimensional coordinates in the SLAM coordinate system in the SLAM map are associated. The SLAM processing unit 22 stores the information Cfp-kf in the storage unit 12.

The reference image data acquisition unit 23 also selects at least one of the frames included in the received moving image data as a target frame. This selection may be performed artificially, or the frame selected as a key frame in the reconstruction processing of the three-dimensional space in the SLAM processing unit 22 may be used as the target frame as it is.

The reference image data acquisition unit 23 is reference image data in which at least one subject common to the subject captured in the selected target frame is captured, and positional information in world coordinates representing a position captured in advance is associated Get the reference image data that is being

In an example of the present embodiment, it is assumed that moving image data to be received is captured by a camera mounted on a vehicle or the like moving along a road. In this case, the reference image data can be searched, for example, from Google Street View (https://www.google.com/streetview/) of Google Inc. in the United States. Google Street View is a searchable GIS (Geographic Information System) for the streets, and is one of the large geotag image datasets for countries around the world. In this Google Street View, all geotag images are given as information in which a panoramic image and latitude and longitude information are associated. In this embodiment, the panoramic image is cut out in eight horizontal directions at the same angle of view as the target frame selected from the received moving image data, and used as a geotag image group.

However, in the present embodiment, the reference image data does not have to be acquired from Google Street View. For example, when the route includes an intersection, if there is data of an image taken at the intersection (image data other than image data extracted from moving image data) and position information (latitude and longitude information) of the intersection, This can be used as a geotag image.

Further, in the present embodiment, the moving route of the moving image data to be accepted may not necessarily be outdoors, and may be, for example, a route moving in a facility such as a store. In this case, as the reference image data, image data captured separately (apart from moving image data) on the moving path and coordinate values in the world coordinate system appropriately set in the facility are used. As an example, in the world coordinate system in this case, when the facility is viewed from above, a specific point in the facility is taken as the origin, and the positive direction of the X axis to the east and the positive direction of the Y axis to the north The value of (X, Y) in the unit of metric may be set as the value of the orthogonal coordinate system.

Also, as described above, since world coordinates are assigned to geotagging images in Google Street View, if the camera position of the geotagging image can be estimated in the SLAM coordinate system, the correspondence between SLAM and world coordinates is It can be acquired.

The reference image data acquisition unit 23 inputs, for example, information (for example, a city name, a town name, or within a specified distance (for example, 400 meters) from a specified point) specifying an area where the received moving image data is captured. Receiving the geotagging image associated with the latitude and longitude information in the input area from the server of Google Street View, and each of the geotagging images with the target frame selected from the received moving image data At the same angle of view, cut out in eight horizontal directions to form a geotag image group.

The reference image data acquisition unit 23 selects k geotag images similar to the selected target frame in descending order of the degree of similarity as reference image data. Here, the degree of similarity is a method of using a bag-of-words approach using SIFT feature quantities (Agarwal, W. Burgard, and L. Spinello, “Metric localization using google street view,” IROS, pp. 3111-3118 , 2015.) etc. may be adopted.

In another example, the reference image data acquisition unit 23 displays and outputs the selected target frame, and, for example, Google Street View provides the user with a geotag image in the vicinity of the target frame that has been displayed and output. The geotag image may be selected, and the selected geotag image may be acquired as reference image data.

The search unit 24 acquires a set of corresponding feature points from each of the target frame selected by the reference image data acquisition unit 23 and the reference image data acquired by the reference image data acquisition unit 23. Specifically, the search unit 24 determines ORB feature points (E.Rublee, V.Rabaud, K.Konolige, and G.Bradski, “Orb: An effective alternative to sift” from the target frame and the reference image data, respectively). or surf, see ICCV, pp. 2564-2571, 2011). Then, matching of feature points detected from each of the target frame and the reference image data is performed. Although a widely known method may be adopted as a method of this matching, it is preferable that the search unit 24 remove a wrongly matched matching by using a VLD (Virtual Line Descriptor) or the like. It is. The details of VLD are disclosed in Z. Liu and R. Marlet, “Virtual line descriptor and semilocal matching method for reliable feature correspondence,” BMVC, pp. 16-1, 2012, and are widely known. Detailed explanation here is omitted.

In the present embodiment, the search unit 24 further obtains a plurality of reference image data corresponding to one target frame (as described above, for example, when k sheets are selected in descending order of similarity). In this case, the information related to the reference image data whose number of corresponding feature points is less than a predetermined threshold (for example, 5) is deleted.

At this time, among the reference image data obtained corresponding to the target frame, the corresponding feature points with the target frame (hereinafter, when it is necessary to distinguish, the feature points found from the reference image data) When there is no reference image data in which the number of reference feature points exceeds the threshold value, the search unit 24 excludes the target frame from selection (does not set it as a target frame).

The conversion relation acquisition unit 25 estimates a conversion relation CSLAM-World between the SLAM coordinate system and the world coordinate system. Specifically, the conversion relationship acquiring unit 25 first estimates the posture of the camera that has captured the reference image data. That is, the conversion relationship acquiring unit 25 sets the SLAM map in the corresponding SLAM coordinate system for each of the feature points included in the SLAM map for which the corresponding reference feature point is found in the reference image data by the search unit 24. Get the value of the coordinate of the feature point of. In addition, the conversion relationship acquisition unit 25 receives information on the position of reference feature points in the reference image data corresponding to each feature point (information on a two-dimensional position in the reference image data) obtained by the search unit 24. Information Cmap-geo in which the value of each feature point in the SLAM coordinate system is associated with the information on the position of the corresponding reference feature point in the reference image data is obtained.

Then, the conversion relationship acquisition unit 25 is a reprojection error when the Cmap-geo is reprojected onto the reference image data in the orientation information (information of six degrees of freedom) of the camera that captured the reference image data in the SLAM coordinate system. You get by minimizing This minimization is performed, for example, using the Levenberg-Marquardt method.

The conversion relationship processing unit 25 acquires a set of information on the posture of the camera that captured the reference image data in the SLAM coordinate system and the world coordinates associated with the reference image data, and based on these, a widely known method The transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system is obtained.

The transformation processing unit 26 transforms the SLAM map using the transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system.

In the processing of this modification, the conversion processing unit 26 sequentially performs processing of initialization processing, pose graph optimization, and bundle adjustment. In the initialization process, the following linear transformation is sequentially performed on the entire SLAM map obtained up to the time of processing using the correspondence relationship between the SLAM coordinate system of the SLAM map and the world coordinate system, CSLAM-World, This is a process that roughly matches the world coordinate system. This initialization process is performed only at the beginning of the process and at a timing at which a predetermined condition is satisfied. Specifically, the initialization process is the timing at which the search unit 24 obtains a set of the target frame in which the feature points (the number of which is equal to or more than the predetermined threshold) are found and the reference image data. Of i) (i is an integer of i ≧ 2), and the distance between the estimated position and the information on the position of the reference image data is a predetermined threshold (for example, 10 m) Do it when it exceeds.

The conversion processing unit 26 does not perform pose graph optimization and bundle adjustment processing until the first initialization is performed.

The conversion processing unit 26 rotates the SLAM map so that the plane coincides with the xz plane, assuming that the camera capturing the moving image data exists on the same plane as the first linear conversion in the initialization process. The plane of the camera used at that time is estimated by principal component analysis of all the determined camera positions.

As a second linear transformation, the similarity transformation is performed to bring the point p on the SLAM coordinate system in the first to i-th CSLAM-World closer to the point pworld on the corresponding world coordinate system ((1) formula).

The four parameters [a, b, s, θ] in this transformation matrix give the cost function E

It is estimated by solving the nonlinear least squares problem using RANSAC (Random Sampling Consensus) and Levenberg-Marquardt method. Note that pworld, k means information of a position associated with reference image data included in the k-th set of the first to i-th set of key frames and reference image data.

Based on the estimated transformation matrix, the orientation of the camera capturing the object frame and the reference image data, and the position of the feature point of the SLAM map (three-dimensional position in the SLAM coordinate system) are transformed. The first and second linear transformations here are both types of three-dimensional similarity transformations, and these transformations do not improve the scale drift.

After initialization, the conversion processing unit 26 performs scale drift improvement processing by pose graph optimization each time position estimation is newly performed using the i-th reference image data. Specifically, as illustrated in FIG. 3, the pose graph used here is, as illustrated in FIG. 3, a node Sn representing posture information of a camera that has captured a frame, a node Sm representing posture information of a camera that has captured reference image data, and a reference image A first constraint relating to the posture between adjacent (temporally adjacent in the moving image data) nodes Sn including the node fp representing the position information associated with the data and representing the posture information of the camera that captured the frame Constraint) are mutually connected by e1. That is, the relative conversion between camera poses between adjacent frames is restricted by this constraint e1.

Further, among the nodes Sn, the node Sn corresponding to the target frame and the node Sm representing the posture information of the camera that captured the reference image data including the feature points corresponding to the feature points included in the They are connected to each other by the two constraints e2. That is, the relative conversion between the captured camera postures of the target node and the corresponding reference image data is restricted by the restriction e2.

Further, the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information associated with the reference image data are mutually connected by the third constraint e3 relating to the distance. .

As shown in FIG. 3B, the conversion processing unit 26 according to the present embodiment suppresses the change of the camera's posture information at each node, and also suppresses the change of the moving direction of the camera. While, a change in scale is allowed to correct information on the movement path of the camera that has captured the frame. At this time, the conversion processing unit 26 causes the position of the reference image data associated with the target frame that is a part of the frame to be close to the information of the position in the world coordinate system originally associated with the reference image data. Make corrections.

The specific method of this process will be described next. Here, an example will be described in which the scale drift is improved by the constrained pose graph optimization in the three-dimensional similarity transformation group Sim (3). In pose graph optimization, the camera posture is taken as an optimization variable, and optimization is performed in consideration of constraints on relative conversion between camera postures. That is, in the present embodiment, the pose graph is optimized by performing nonlinear deformation in consideration of scale drift using the pose graph.

For this purpose, first, the representation of the three-dimensional similarity transformation group will be described. In general, relative transformations between camera postures and camera postures with six degrees of freedom are expressed as elements of the special Euclidean group SE (3). On the other hand, in the optimization performed by the conversion processing unit 26 in the present embodiment, the camera attitude and the relative conversion thereof are treated as elements of Sim (3) as described above.

The three-dimensional rigid transformation G belonging to the special Euclidean group SE (3) is defined by the following equation (2). However, the rotation matrix

, A translation vector t (t is a vector quantity actually represented by a bold face, but is expressed as t in the description of this specification for convenience) is a three-dimensional vector quantity of a real component, s Is a nonnegative real number value.

Here, the conversion from SE (3) to Sim (3) is performed by setting s of the scale component to 1 without changing R of the rotation matrix and t of the translation vector. That is, the three-dimensional similarity transformation S (S belongs to Sim (3)) is

It becomes.

SO (3), SE (3) and Sim (3) all belong to the Lie group, are converted to the corresponding Lie algebra by the exponential mapping, and also define the inverse mapping, the logarithmic mapping. Here, it is assumed that Lie algebra is represented by a vector notation of coefficients. For example, the Lie algebra corresponding to Sim (3) is a seven-dimensional (seven degrees of freedom) vector including a component representing relative transformation in a six-dimensional (six degrees of freedom) vector representing posture information of a camera

Notated, and its exponential map is

It is defined as In addition, T of the shoulder of a vector or matrix represents transposition (the same applies to the following).

Also, the logarithmic map is

It becomes.

Here, W is a term similar to Rodriguez's formula. Using this three-dimensional similarity transformation group representation, the cost function associated with the constrained deformation of the pose graph is defined, and the cost function is minimized by the Levenberg-Marquardt method on Lie groups to obtain the original SLAM. While maintaining the structure of the map, the process of improving the scale drift and the process of bringing the corresponding points between the two coordinate systems of the SLAM coordinate system and the world coordinate system close to each other are performed at one time.

Here, H. Strasdat, J. Montiel and A. J. Davison: “Drift scale-aware large scale monocular slam”, Robotics: Science and Systems VI, for use of the Levenberg-Marquardt method on the Lie group As it is as shown in (2010), the detailed description here is omitted.

Now, using the above three-dimensional similarity transformation, a first restriction on the camera's posture between adjacent (in time-adjacent in the moving image data) nodes Si and Sj representing the posture information of the camera that captured the frame Condition e1

I assume.

here

Is the relative transformation between Si and Sj before optimization converted to Sim (3), and this value is a fixed value during the optimization process.

As already mentioned, this logarithmic map is a 7-dimensional real vector

It becomes.

Similarly, a constraint condition e2 relating to relative conversion between the camera postures respectively captured between the target node and the corresponding reference image data,

I assume.

Further, the third constraint condition e3 relating to the distance between the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information ym associated with the reference image data is

Using,

I assume.

Of these, the minimization of e1i, j and e2k, l serves to suppress changes in relative transformations between camera poses except for gradual scale changes. Further, the minimization of e3m works to bring the camera position of the reference image data closer to the information of the position in the world coordinate system associated with the reference image data.

The pause graph used by the position estimation device 1 according to the present embodiment is as follows.
Node Sn: Attitude of the camera when capturing the nth key frame. Sn ∈ Sim (3), n ∈ {1, 2, ... N}.
Node Sm: Posture of the camera when imaging the mth reference image data. Sm ∈ Sim (3), m ∈ {1, 2, ... M}.
Edge e1i, j: restriction due to relative conversion between camera poses when shooting the i, j th key frame. (I, j) ∈ C1
Edge e 2 k, l: constraint due to relative conversion between camera poses when shooting the reference image data of the k th and l th. (K, l) ∈ C2
Edge e3 m: Distance between the attitude Sm of the camera that captured the reference image data and the position information ym in the world coordinate system associated with the reference image data. m ∈ {1, 2, ... M}.

Here, N is the total number of key frames, M is the number of target frames (the total number of reference image data associated with the target frames), C 1 is a set of key frames in which the same feature points in the SLAM map are captured, C2 represents a set of the target frame and the corresponding reference image data.

After the initialization process described above, the conversion processing unit 26 executes a process of optimizing the pose graph defined as described above. That is, the conversion processing unit 26 extracts at least one set of key frame group C1 (including N key frames) in which a common feature point is imaged from the frames of the received moving image data. Also, referring to the set C2 (M sets of reference image data) including the target frame and the corresponding feature points found by the search unit 24, the cost function on the next Lie manifold is Levenberg's · Estimated estimated posture information S1, S2 ... of the camera by minimization according to the Marquardt method.

Furthermore, the conversion processing unit 26 reflects the conversion by the optimization also on the position of the feature point in the SLAM map. This reflection may be carried out using widely known methods such as those used in H. Strasdat, et. Al., “Scale drift-aware large scale monocular slam,” Robotics: Science and Systems VI, 2010. it can.

As described above, in the present embodiment, when the node Sm representing the posture of the camera capturing the m-th reference image data and the k-th and l-th reference image data are captured from the pose graph used by Strasdat et al. The distance between an edge e2k, l representing a restriction due to relative conversion between camera postures, a posture Sm of the camera that photographed the reference image data, and position information ym in the world coordinate system associated with the reference image data By adding an edge e3m to represent, the scale of the SLAM map is corrected by making the camera position of reference image data such as a geotag image not close to a loop closure but close to the value in the world coordinate system represented by position information (geotag). I am improving the drift.

In addition, the conversion processing unit 26 adds, to the received moving image data frame, a key frame group C1 in which a common feature point other than the key frame group C1 subjected to the process of optimizing the pose graph is captured. If there is, the process of optimizing the pose graph is sequentially performed on the key frame group C1.

The conversion processing unit 26 further deforms the SLAM map by bundle adjustment (BA) including the constraints determined in relation to the reference image data.

That is, after the pose of the camera is estimated by optimizing the pose graph and the SLAM map is corrected, the conversion processing unit 26 performs the next bundle adjustment. The conversion processing unit 26 of the present embodiment combines the Cfp-kf reprojection error and the Cfp-geo reprojection error together to minimize.

Specifically, the conversion processing unit 26 which performs the bundle adjustment performs reprojection error ri, j between the i-th feature point and the j-th camera posture,

Ask as.

here

Where Xi is a coordinate of the feature point in the SLAM coordinate system, xi is a two-dimensional coordinate within the frame of the feature point, and Rj and tj represent rotation and translation of the j-th camera posture. The function π is a function that projects three-dimensional coordinates (p = [px, py, pz] ^T ) onto a two-dimensional coordinate system, (fx, fy) is the focal length, and (cx, cy) is the projection Represents the coordinates of the center.

Further, in order to reflect the position information of the reference image data in the bundle adjustment, the total cost function is defined as follows, including the restriction on the reference image data.

Here, Tj is information on the posture of the camera that photographed the j-th key frame, and is expressed as an element of SE (3). ρ is a Huber robust cost function, and C5 represents a feature point in a key frame of C1. C1 and C3 are the already mentioned sets.

The feature points of the key frame and the information on the camera pose can be obtained by minimizing the total cost function on this Lie manifold. This minimization can be done using the Levenberg-Marquardt method.

As described above, in the present embodiment, not only the general Cfp-kf reprojection error but also the Cfp-geo reprojection error is minimized in order to combine the constraints with the position information related to the reference image data. Ru. During bundle adjustment, the posture information of the camera that photographed the reference image data is fixed. This variation can further reduce the scale drift of the SLAM map, particularly when a sufficiently good initial solution is provided.

[Operation]
The position estimation device 1 of the present embodiment has the above configuration and operates as follows. In an example of the present embodiment, the position estimation device 1 receives moving image data including a series of frames obtained by imaging a subject on a moving route while moving, and executes the process illustrated in FIG. 4 below. Do.

The position estimation device 1 of the present embodiment extracts feature points related to the subject captured in each frame included in the motion image data, and reconstructs the feature points as a predetermined virtual three-dimensional space. The SLAM map, which is a reconstruction map associated with the coordinates in the space, is generated (S1: SLAM processing).

Next, the position estimation device 1 selects at least one of the frames included in the received moving image data as a target frame (S2), and repeats the following processing for each target frame.

That is, the position estimation device 1 sets the selected target frames one by one in a predetermined order as the target frame. The position estimation apparatus 1 is reference image data in which at least one subject imaged in the target frame is imaged, and positional information in a predetermined world coordinate system representing a position imaged in advance is associated The reference image data being acquired is acquired (S3).

As described above, this acquisition process may be performed by instructing the server on the network to search, or by instructing the user to input the reference image data. Good.

The position estimation device 1 searches for the reference feature point corresponding to the feature point found from the target frame in the acquired reference image data (feature point matching: S4). The position estimation device 1 checks whether the number of reference feature points searched here is found more than a predetermined threshold (for example, five) (S5). The position estimation device 1 captures the reference image data acquired in step S3 if the reference feature point retrieved in step S5 is equal to or greater than a predetermined threshold (S5: Yes). The position and the attitude of the obtained camera are estimated (S6). This processing corresponds to the feature point coordinates (two-dimensional coordinates) in the target frame, the feature point coordinates (three-dimensional coordinates) in the SLAM map coordinate system, and the feature points in the reference image data. This can be performed using the coordinates (two-dimensional coordinates) of the reference feature point to be detected and the position information (three-dimensional coordinate values in the world coordinate system) associated with the reference image data.

The position estimation device 1 returns to the process S3 and sets the next target frame as the target frame and continues the processing. Here, if there is no target frame that has not been set as a target frame of interest (if the processing for all the target frames is completed), the process proceeds to the next processing (processing S10 described below).

If the reference feature point searched in the process S5 is less than the predetermined threshold (S5: No), the process returns to the process S3 without executing the process S6, and the next target frame is set as the target frame. Set and continue processing. Also here, if there is no target frame that has not been set as the target frame at that time (if the processing for all the target frames is completed), the process proceeds to the next processing (processing S10 described below).

By the processes from S3 to S6, the position estimation device 1 uses the world coordinate system, which is the coordinate system of the position information associated with each reference image data, and the coordinate system of the coordinate values of the feature points found from each target frame. A transformation relationship with a coordinate system of a SLAM map can be obtained.

Next, the position estimation device 1 performs transformation processing of the SLAM map. In this process, the position estimation device 1 first determines whether to perform an initialization process (S10). Here, for example, when the initialization process has not been performed in the past, it may be determined to perform the initialization process. Also, when performing sequential processing described later, instead of or together with the above determination, information of the position estimated as the position of the camera last time (this estimation will be described later) and the corresponding reference image It may be determined that the initialization process is to be performed when the distance from the position information associated with the data exceeds a predetermined threshold (for example, 10 m).

When it is determined that the initialization process is to be performed in the process S10, the position estimation apparatus 1 executes the initialization process (S11), and first assumes that the camera capturing the moving image data exists on the same plane, and the next process I do. That is, the position estimation device 1 performs principal component analysis of the information on the position based on the position of the camera (camera that has captured the received moving image data) in the SLAM map obtained so far. Determine the coordinate axes of the plane where the camera is supposed to be located (eg the axis of the normal to the plane), and make this normal parallel to the y-axis (so that the plane where the camera is located is the xz plane) SLAM Rotate the map.

Further, in this initialization process, the position estimation device 1 uses the frame and reference image data obtained so far and the position information (position information in the world coordinate system) associated with the reference image data. Shrink the SLAM coordinate system so that the sum (or the sum of squares) of the absolute value of the difference between the coordinate value of each feature point on the SLAM coordinate system and the value of the corresponding reference feature point in the world coordinate system becomes minimal Enlarge (similarity conversion).

When the position estimation device 1 determines that the initialization process is not performed, or after the initialization process, as illustrated in FIG.
Node Sn: Attitude of the camera when capturing an n-th key frame; Sn (Sim (3), n ∈ {1, 2, ... N},
Node Sm: Posture of the camera when imaging the mth reference image data; Sm ∈ Sim (3), m ∈ {1, 2, ... M},
Edge e1i, j: constraint due to relative conversion between camera poses when shooting the i, j th key frame; (i, j) ∈C1,
Edge e 2 k, l: constraint due to relative conversion between camera postures when shooting the reference image data of the k th and l th; (k, l) ∈C 2,
Edge e3 m: distance between the posture Sm of the camera that captured the reference image data and the position information ym in the world coordinate system associated with the reference image data; m ∈ {1, 2, ... M},
Optimize the constrained pose graph in Sim (3) including (S12).

Through this processing S12, the cost value related to the change in the estimation result of the position and orientation represented by SLAM coordinates in the reconstruction space (SLAM map) of the camera that has captured the moving image data, and scaling of the movement amount of the camera An optimization process is performed using a cost function that includes a cost value and a cost value related to the distance between the position of the camera and the position information associated with the reference image data, and the SLAM of the camera that captured moving image data Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates, and bringing the position and orientation of the camera close to position information associated with the corresponding reference image data The correction is performed at one time.

The position estimation device 1 also performs bundle adjustment processing to correct the reprojection error of the feature point to the key frame and the reprojection error of the reference feature point to the reference image data (S13). As a result, an estimation result of the position and orientation represented by the world coordinate system in each target frame of the camera that has captured the moving image data can be obtained.

The position estimation device 1 outputs the information (result of estimation) of the position and orientation of the camera in the world coordinate system obtained in the processing up to this point (S14), and ends the processing.

[Sequential input]
In the above description, the position estimation device 1 has been described as an example of receiving moving image data whose imaging has already been completed, but the present embodiment is not limited to this.

In an example of the present embodiment, moving image data including a series of frames obtained by imaging an object on a moving path while moving is sequentially taken each time a part of frames (for example, one frame) is imaged. It may be acceptable to When processing while capturing moving image data in this manner, the SLAM map is generated by sequential processing (that is, the feature points of the subject captured in the frame that can be sequentially received are extracted, and the feature points are extracted. Is associated with coordinates in the reconstruction space, which is a predetermined virtual three-dimensional space). Such SLAM map generation processing is widely known as incremental SfM (Structure from Motion) or the like, and thus detailed description thereof is omitted here.

In this example, the position estimation device 1 sequentially performs the subsequent processing. That is, the position estimation device 1 determines whether or not the received frame is to be used as a key frame, and in the case of using it as a key frame, selects the key frame as an object frame and performs processing of step S3 and the following steps in FIG. You should do it. In this case, the difference between the position of the camera estimated for a certain frame by the processing S12 and S13 in FIG. 4 and the position represented by the position information associated with the reference image data corresponding to the frame is a predetermined threshold. In the case of exceeding S.sub.0, the process may return to the process S10 and perform the initialization process again.

Next, a position estimation method different from the embodiment of the present invention using reference image data (in the following example, a geotag image) associated with position information, and a position estimation method according to the embodiment of the present invention Are compared to quantitatively evaluate the accuracy of the method of the embodiment in which the position estimation device 1 according to the embodiment of the present invention described above is implemented by a computer.

Further, in the following embodiments, (1) initialization processing (linear transformation: ILT), (2) optimization processing of constrained pose graph on Sim (3) (PGO), (3) frames of feature points About bundle modification processing (BA) which corrects so that the reprojection error to the reference image data of the re-projection error to the reference image data and the re-projection error to the reference image point becomes three steps of deformation processes (1) to (3) Examine the impact on accuracy of each process and its usefulness.

[Data set and implementation details]
In the following embodiment, it is assumed that a traveling video data set taken over a long distance in the center of Spain called The Ma'laga Stereoand Laser Urban Data Set (Ma'laga data set) is used. The video of this Ma'laga data set has a resolution of 1024 × 768 and a frame rate of 20 fps. In the present embodiment, two types of video (Video 1 and 2) are cut out from the video and used for evaluation. Here, the path taken by the two types of images does not draw a loop, and the path length is 1 km or more. Further, although all the frames are associated with GPS position information acquired every one second, a frame including an error of 10 m or more is also included.

[Evaluation index]
For quantitative comparison, the estimated positions of the positions of a plurality of points on the correct trajectory (Ground Truth: GT) described later and the positions of the camera when capturing a key frame corresponding to the points on this GT The average (Ave) and the standard deviation (SD) of the distance (in meters) between the point and the position of the GT (correct answer) are used as evaluation indexes. As mentioned earlier here, the GPS location information pre-assigned to the data set contains errors, so from the 3D map of Google Street View, manually, on some keyframes on GT The position information of the point of was set.

[Comparison with other position estimation methods]
Also, as another position estimation method to be compared, the method described in T. Kroeger et. Al., “Video registration to sfm models,” ECCV, pp. 1-16, 2014 (hereinafter referred to as the Kroeger method) And SP Chang, et. Al., “Extracting driving behavior: Global metric localization from dashcam videos in the wild,” ECCV, pp. 136-148, 2016 (referred to as Chang method) The method of and the comparison with the present example were performed.

Unlike the Kroger's method and the Chang's method, since this embodiment uses RGB-D (with distance information) images and 3D point group information, the SLAM is used in the present embodiment, and the world coordinate system and the SLAM coordinate system are used. The portion for acquiring the correspondence with (the reconstructed three-dimensional coordinate system) is replaced with the same processing as the present embodiment, and only the processing for deforming the SLAM map is compared. The Kroeger method uses spline smoothing to smooth the camera pose. Here, as an example of the method, specifically, the camera posture of the geotag image (reference image data) corresponding to the key frame is smoothed (Interpolation) using Cubic B-Spline.

In addition, Chang's method performs two-dimensional affine transformation on the xz plane so that the SLAM map restored from the input moving image data matches the 3D point group acquired from Google Street View, in the y-axis direction Performs scale conversion. Here, specifically, the method of applying the same conversion (Affine +) as the method of Chang to the SLAM map was used.

The results of these processes are shown in FIG. As shown in FIG. 5, the method of the present embodiment is closer to GT than the other methods, and achieves a significant improvement in accuracy compared to the other methods. Specifically, since the Kroeger method complements the correspondence between sparse geotags and images without considering the three-dimensional structure, a large error occurs in the position estimation where there are not enough corresponding points. . In addition, Chang's method uses a three-dimensional structure of the image, but uses a simple linear transformation, so it is greatly affected by distortion due to scale drift, and the accuracy is sufficiently high. In some cases, the position can not be estimated.

[Influence of three stages of deformation]
In addition, with respect to the three-step deformation performed in the present embodiment, it was examined how much the processing of each step affected the accuracy. Specifically, the position estimation process based on Video 1 was performed using at least a part of the three deformation processes in various combinations. The results are shown in FIG. In FIG. 6, "*" indicates that a meaningful value was not obtained. As can be seen from FIG. 6, the most accurate position estimation is realized when all deformation methods are applied (# 5). When processing results of # 1 to # 5 illustrated in FIG. 6 are plotted on an actual map, bundle adjustment (BA) is applied when pose graph optimization (PGO) is not applied (# 2, # 3) Even in places where an error of about 100 meters had occurred. On the other hand, when pose graph optimization (PGO) was applied (# 4, # 5), scale drift was eliminated, and as a result, it was confirmed that bundle adjustment (BA) was working properly in the end. .

From this, it was found that at least the bundle adjustment needs to be subjected to pose graph optimization.

[in conclusion]
In this embodiment, the absolute correspondence of the camera when each frame of moving image data is captured from the sparse correspondence between the geotag image and the captured moving image data using the result of three-dimensional reconstruction (SLAM) The position (the position in the world coordinate system, that is, information related to the latitude and longitude) was estimated. When moving image data used in this embodiment is imaged on a long distance route exceeding 1 km and the trajectory of the route does not draw a loop (when it does not return to the same point), the third order An error of several tens of meters may occur on the actual scale in the result of the original restoration.

This is because in general three-dimensional reconstruction from moving image data such as SLAM, error accumulation can be improved only when the movement trajectory at the time of imaging draws a loop (the same place is observed again). It is.

In this embodiment, processing for improving scale drift is integrated using sparse correspondence (association of only partial frames) between a geotag image and captured moving image data, and estimation of absolute position is performed. It has become possible to properly use the three-dimensional reconstruction structure.

The accuracy is considered to be improved as the error included in the position information such as the latitude and longitude associated with the reference image data is smaller. Also, the feature points in the frame included in the captured moving image data and the reference feature points included in the corresponding reference image data exist at a long distance from the position at which each image was captured, and a small number of them If there is, the matching accuracy between feature points will not be sufficient. In addition, when the camera points in a direction parallel to the roadway (moving locus), or when the angle of view of the camera is narrow, the camera moves from the camera to the feature point group when estimating the position and orientation of the camera that captured the reference image data. The error in the direction of is relatively large. This error mainly occurs in the direction parallel to the roadway.

Therefore, it is preferable to further improve the accuracy of the feature point matching method and / or to perform processing including an image in a direction orthogonal to the movement trajectory using a device such as an omnidirectional camera.

Reference Signs List 1 position estimation apparatus, 11 control unit, 12 storage unit, 13 operation unit, 14 display unit, 15 communication unit, 16 interface unit, 21 reception unit, 22 SLAM processing unit, 23 reference image data acquisition unit, 24 search unit, 25 Conversion relationship acquisition unit, 26 conversion processing unit.

Claims

Receiving means for receiving moving image data including a series of frames obtained by imaging an object on a moving path while moving;
Reference image data in which at least one of the frames included in the received moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined image representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the world coordinate system of
A feature map of the subject captured in each frame included in the moving image data is extracted, and a feature map is associated with coordinates in a reconstruction space, which is a predetermined virtual three-dimensional space. A reconstruction means to generate;
Search means for searching for reference feature points corresponding to the feature points in the reference image data;
A world coordinate system, which is a coordinate system of the position information, based on the retrieved reference feature points, position information associated with the reference image data, and coordinates associated with the feature points, and the feature points Relationship acquisition means for acquiring a conversion relationship with the coordinate system of the coordinates associated with
Conversion means for correcting the reconstruction map using the conversion relationship;
Including
Correction for scaling the amount of movement of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data; A position estimation device that corrects the reconstruction map by performing a correction that brings a position and a posture into proximity with position information associated with the reference image data.
The position estimation device according to claim 1, wherein
The conversion means includes a cost value relating to a change in an estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera which has captured the moving image data, and a cost value relating to scaling of the movement amount of the camera. Using a cost function including a cost value relating to the distance between the position of the camera and the position information associated with the reference image data, a table of coordinates in the reconstruction space of the camera that captured the moving image data Correction to scale the amount of movement of the camera while suppressing changes in estimated results of the position and orientation, and correction to bring the position and orientation of the camera close to position information associated with the reference image data at one time Position estimation device to do.
The position estimation device according to claim 1 or 2, wherein
The conversion means scales the movement amount of the camera while suppressing the change in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera which has captured the moving image data, and After correcting the position and orientation to be close to the position information associated with the reference image data, the reprojection error of the feature point to the frame and the reference image data of the reference feature point are further added. A position estimation device that performs bundle adjustment processing for correcting so as to reduce the reprojection error of.
The position estimation device according to any one of claims 1 to 3, wherein
The receiving means sequentially receives moving image data including a series of frames obtained by imaging a subject on a moving path while moving, each time a part of the frames is imaged,
The reconstruction means extracts a feature point related to a subject imaged in the sequentially accepted frame, and associates the feature point with coordinates in a reconstruction space which is a predetermined virtual three-dimensional space Estimator.
Using a computer
The computer
Accept moving image data including a series of frames obtained by imaging an object on a moving route while moving
Reference image data in which at least one of the frames included in the received moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined image representing a position captured in advance Get reference image data associated with the position information in the world coordinate system of
A feature map of the subject captured in each frame included in the moving image data is extracted, and a feature map is associated with coordinates in a reconstruction space, which is a predetermined virtual three-dimensional space. Generate
Searching for a reference feature point corresponding to the feature point in the reference image data;
A world coordinate system, which is a coordinate system of the position information, based on the retrieved reference feature points, position information associated with the reference image data, and coordinates associated with the feature points, and the feature points Get the transformation relation with the coordinate system of the coordinates associated with
Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data, and the position and orientation of the camera, A position estimation method for correcting the reconstruction map by performing correction for bringing the position information associated with the reference image data into proximity.
Computer,
Receiving means for receiving moving image data including a series of frames obtained by imaging an object on a moving path while moving;
Reference image data in which at least one of the frames included in the received moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined image representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the world coordinate system of
A feature map of the subject captured in each frame included in the moving image data is extracted, and a feature map is associated with coordinates in a reconstruction space, which is a predetermined virtual three-dimensional space. A reconstruction means to generate;
Search means for searching for reference feature points corresponding to the feature points in the reference image data;
A world coordinate system, which is a coordinate system of the position information, based on the retrieved reference feature points, position information associated with the reference image data, and coordinates associated with the feature points, and the feature points Relationship acquisition means for acquiring a conversion relationship with the coordinate system of the coordinates associated with
Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data, and the position and orientation of the camera, Converting means for correcting the reconstruction map by performing correction to bring the position information associated with the reference image data into proximity.
A program to function as