WO2018235923A1 - Position estimating device, position estimating method, and program - Google Patents

Position estimating device, position estimating method, and program Download PDF

Info

Publication number
WO2018235923A1
WO2018235923A1 PCT/JP2018/023697 JP2018023697W WO2018235923A1 WO 2018235923 A1 WO2018235923 A1 WO 2018235923A1 JP 2018023697 W JP2018023697 W JP 2018023697W WO 2018235923 A1 WO2018235923 A1 WO 2018235923A1
Authority
WO
WIPO (PCT)
Prior art keywords
image data
camera
reference image
coordinate system
coordinates
Prior art date
Application number
PCT/JP2018/023697
Other languages
French (fr)
Japanese (ja)
Inventor
清晴 相澤
和也 石見
Original Assignee
国立大学法人 東京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人 東京大学 filed Critical 国立大学法人 東京大学
Publication of WO2018235923A1 publication Critical patent/WO2018235923A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Definitions

  • the present invention relates to a position estimation device, a position estimation method, and a program.
  • GIS geographical information system
  • the method of estimating the absolute position of the image using the image with the geotag is a position with relatively high accuracy if the correspondence between the images can be sufficiently obtained compared to the method using the road network or the satellite image
  • estimation is possible.
  • the correspondence between the input traveling video and the geotagged image may be difficult due to the influence of the change of the illumination environment or the angle of view taken. If this association is not appropriate, position estimation can not be performed, or estimation with the required accuracy can not be performed.
  • the present invention has been made in view of such actual circumstances, and for each frame image of a traveling image including a place where it is not possible to directly associate with a geotagged image, information of corresponding positions is provided. It is an object of the present invention to provide a position estimation device, a position estimation method, and a program that can be set.
  • Non-Patent Document 1 three-dimensional information of a three-dimensional object captured in a traveling image is estimated (reconstructed) by SLAM (Simultaneous Localization and Mapping), and a three-dimensional reconstruction map obtained by this estimation is used. It is disclosed that the position estimation of the traveling image is performed by deforming it as an object in the world coordinate system, but in the method of this non-patent document 1, the error accumulates as the traveling distance becomes longer, and the position It is known that the estimation accuracy is reduced.
  • SLAM Simultaneous Localization and Mapping
  • the present invention for solving the problems of the prior art is a position estimation apparatus, which receives moving image data including a series of frames obtained by imaging an object on a moving path while moving; Reference image data in which at least one of the frames included in the moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined world representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the coordinate system, feature points related to the subject imaged in each frame included in the moving image data, and extracting the feature points Reconstruction means for generating a reconstruction map associated with coordinates in a reconstruction space which is a virtual three-dimensional space; and corresponding to the feature point in the reference image data
  • the coordinate system of the position information based on the search means for searching the reference feature point, the searched reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point
  • a relationship acquiring unit that acquires a conversion relationship between the world coordinate system and the coordinate system of the coordinates associated with the
  • the position and orientation of the image are corrected to be close to the position information associated with the reference image data to correct the reconstruction map.
  • information of corresponding positions can be set for each frame image of a traveling video including a place where it is not possible to directly associate with a geotagged image.
  • the position estimation device 1 is, as illustrated in FIG. 1, a control unit 11, a storage unit 12, an operation unit 13, a display unit 14, a communication unit 15, and an interface unit 16. And including.
  • the control unit 11 is a program control device such as a CPU, and operates in accordance with a program stored in the storage unit 12.
  • the control unit 11 receives moving image data including a series of frames obtained by capturing an object on a moving path while moving.
  • the control unit 11 is reference image data in which at least one subject imaged in the target frame is imaged, with at least one of the frames included in the moving image data received here as the target frame, and is imaged in advance Reference image data associated with position information in world coordinates representing the position is acquired.
  • control unit 11 extracts feature points related to the subject captured in each frame included in the received moving image data, and coordinates the feature points in the reconstruction space, which is a predetermined virtual three-dimensional space. While processing as SLAM (Simultaneous Localization and Mapping) to be associated with, the reference feature point corresponding to the feature point is searched in the reference image data.
  • SLAM Simultaneous Localization and Mapping
  • the control unit 11 sets the retrieved reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point (the coordinate system of this feature point, that is, the coordinate system in the reconstruction space is Based on the SLAM coordinate system, the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system is acquired.
  • the control unit 11 then converts the coordinates in the SLAM coordinate system associated with the feature points imaged in each frame into the values of the coordinates in the world coordinate system, using this conversion relationship.
  • the world coordinate system is represented by a three-dimensional coordinate system (x, y, z)
  • the (x, z) plane is a universal horizontal Mercator (UTM) orthogonal coordinate which is an orthogonal plane coordinate system in meters. It corresponds to the system.
  • the y-axis corresponds to the altitude from the ground plane (also in meters).
  • values on the UTM Cartesian coordinate system can be converted to coordinate values of latitude and longitude pairs.
  • the control unit 11 when acquiring the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system in this embodiment, the control unit 11 in the SLAM coordinate system of the camera which captured moving image data.
  • a correction for scaling the movement amount of the camera while suppressing a change in the estimation result of the position and orientation represented by the coordinates, and a correction for bringing the position and orientation of the camera close to the position information associated with the reference image data Then, using the coordinates of the feature point represented by the coordinates of the SLAM coordinate system after correction and the values of the coordinates of the corresponding reference feature point in the world coordinate system, the world coordinate system and the SLAM coordinate system Get conversion relation.
  • the detailed operation of the control unit 11 will be described later.
  • the storage unit 12 is a memory device, a disk device, or the like, and holds a program executed by the control unit 11.
  • the storage unit 12 also operates as a work memory of the control unit 11.
  • the operation unit 13 is a keyboard, a mouse, or the like, receives an instruction operation of the user, and outputs the content of the instruction operation to the control unit 11.
  • the display unit 14 is a display or the like, and displays and outputs information in accordance with an instruction input from the control unit 11.
  • the communication unit 15 is a network interface or the like, and transmits information such as a request via the network in accordance with an instruction input from the control unit 11.
  • the communication unit 15 also outputs the information received via the network to the control unit 11.
  • the communication unit 15 is used, for example, when acquiring a reference image such as a geotag image from a server on the Internet.
  • the interface unit 16 is, for example, a USB interface or the like, and outputs moving image data input from a camera or the like to the control unit 11.
  • the control unit 11 is functionally illustrated in FIG. 2 as an acceptance unit 21, a SLAM processing unit 22, a reference image data acquisition unit 23, a search unit 24, and a conversion relationship acquisition unit. 25 and the conversion processing unit 26.
  • the receiving unit 21 receives moving image data including a series of frames obtained by capturing an object on a moving path with the camera while moving the camera.
  • This camera may be a monocular camera (that is, a camera that does not acquire information in the depth direction), and therefore, it is assumed that frames included in captured moving image data do not include depth information.
  • control unit 11 restores a three-dimensional map by SLAM using the received moving image data, and deforms the obtained three-dimensional restoration map into a world coordinate system.
  • position information is associated with all frames of moving image data, including frames obtained by imaging locations such as geotags that can not be directly associated with an image associated with position information related to the world coordinate system.
  • the scale drift is improved in consideration of a problem that a scale error gradually accumulates during the restoration processing (scale drift problem). While mapping to the world coordinate system.
  • the SLAM processing unit 22 extracts a feature point related to the subject captured in each frame included in the moving image data, and associates the feature point with the coordinates in the reconstructed three-dimensional space (SLAM process). Run. Then, the SLAM processing unit 22 issues unique feature point identification information for each feature point, the feature point identification information, information for identifying a frame from which the feature point is extracted, and the three-dimensional space of the feature point. Are associated with each other and stored in the storage unit 12. For example, this process is described in ORB-SLAM (R. Mur-Atral, et. Al., "ORB-SLAM: a versatile and accurate monocular slam system", IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147). -1, 2015, etc., and the contents of the specific processing are widely known, so the description here is omitted.
  • SLAM map information in which the feature point identification information generated here is associated with the coordinate values in the coordinate system (SLAM coordinate system) of the virtual three-dimensional space.
  • a SLAM map information in which the feature point identification information generated here is associated with the coordinate values in the coordinate system (SLAM coordinate system) of the virtual three-dimensional space.
  • KF key frame
  • Cfp-kf information in which values of three-dimensional coordinates in the SLAM coordinate system in the SLAM map are associated.
  • the SLAM processing unit 22 stores the information Cfp-kf in the storage unit 12.
  • the reference image data acquisition unit 23 also selects at least one of the frames included in the received moving image data as a target frame. This selection may be performed artificially, or the frame selected as a key frame in the reconstruction processing of the three-dimensional space in the SLAM processing unit 22 may be used as the target frame as it is.
  • the reference image data acquisition unit 23 is reference image data in which at least one subject common to the subject captured in the selected target frame is captured, and positional information in world coordinates representing a position captured in advance is associated Get the reference image data that is being
  • moving image data to be received is captured by a camera mounted on a vehicle or the like moving along a road.
  • the reference image data can be searched, for example, from Google Street View (https://www.google.com/streetview/) of Google Inc. in the United States.
  • Google Street View is a searchable GIS (Geographic Information System) for the streets, and is one of the large geotag image datasets for countries around the world.
  • GIS Geographic Information System
  • all geotag images are given as information in which a panoramic image and latitude and longitude information are associated.
  • the panoramic image is cut out in eight horizontal directions at the same angle of view as the target frame selected from the received moving image data, and used as a geotag image group.
  • the reference image data does not have to be acquired from Google Street View.
  • the route includes an intersection
  • image data other than image data extracted from moving image data position information (latitude and longitude information) of the intersection. This can be used as a geotag image.
  • the moving route of the moving image data to be accepted may not necessarily be outdoors, and may be, for example, a route moving in a facility such as a store.
  • the reference image data image data captured separately (apart from moving image data) on the moving path and coordinate values in the world coordinate system appropriately set in the facility are used.
  • the world coordinate system in this case, when the facility is viewed from above, a specific point in the facility is taken as the origin, and the positive direction of the X axis to the east and the positive direction of the Y axis to the north
  • the value of (X, Y) in the unit of metric may be set as the value of the orthogonal coordinate system.
  • the reference image data acquisition unit 23 inputs, for example, information (for example, a city name, a town name, or within a specified distance (for example, 400 meters) from a specified point) specifying an area where the received moving image data is captured. Receiving the geotagging image associated with the latitude and longitude information in the input area from the server of Google Street View, and each of the geotagging images with the target frame selected from the received moving image data At the same angle of view, cut out in eight horizontal directions to form a geotag image group.
  • information for example, a city name, a town name, or within a specified distance (for example, 400 meters) from a specified point
  • the reference image data acquisition unit 23 selects k geotag images similar to the selected target frame in descending order of the degree of similarity as reference image data.
  • the degree of similarity is a method of using a bag-of-words approach using SIFT feature quantities (Agarwal, W. Burgard, and L. Spinello, “Metric localization using google street view,” IROS, pp. 3111-3118 , 2015.) etc. may be adopted.
  • the reference image data acquisition unit 23 displays and outputs the selected target frame, and, for example, Google Street View provides the user with a geotag image in the vicinity of the target frame that has been displayed and output.
  • the geotag image may be selected, and the selected geotag image may be acquired as reference image data.
  • the search unit 24 acquires a set of corresponding feature points from each of the target frame selected by the reference image data acquisition unit 23 and the reference image data acquired by the reference image data acquisition unit 23. Specifically, the search unit 24 determines ORB feature points (E.Rublee, V.Rabaud, K.Konolige, and G.Bradski, “Orb: An effective alternative to sift” from the target frame and the reference image data, respectively). or surf, see ICCV, pp. 2564-2571, 2011). Then, matching of feature points detected from each of the target frame and the reference image data is performed.
  • ORB feature points E.Rublee, V.Rabaud, K.Konolige, and G.Bradski, “Orb: An effective alternative to sift” from the target frame and the reference image data, respectively. or surf, see ICCV, pp. 2564-2571, 2011.
  • the search unit 24 remove a wrongly matched matching by using a VLD (Virtual Line Descriptor) or the like. It is.
  • VLD Virtual Line Descriptor
  • the details of VLD are disclosed in Z. Liu and R. Marlet, “Virtual line descriptor and semilocal matching method for reliable feature correspondence,” BMVC, pp. 16-1, 2012, and are widely known. Detailed explanation here is omitted.
  • the search unit 24 further obtains a plurality of reference image data corresponding to one target frame (as described above, for example, when k sheets are selected in descending order of similarity).
  • a predetermined threshold for example, 5
  • the search unit 24 excludes the target frame from selection (does not set it as a target frame).
  • the conversion relation acquisition unit 25 estimates a conversion relation CSLAM-World between the SLAM coordinate system and the world coordinate system. Specifically, the conversion relationship acquiring unit 25 first estimates the posture of the camera that has captured the reference image data. That is, the conversion relationship acquiring unit 25 sets the SLAM map in the corresponding SLAM coordinate system for each of the feature points included in the SLAM map for which the corresponding reference feature point is found in the reference image data by the search unit 24. Get the value of the coordinate of the feature point of. In addition, the conversion relationship acquisition unit 25 receives information on the position of reference feature points in the reference image data corresponding to each feature point (information on a two-dimensional position in the reference image data) obtained by the search unit 24. Information Cmap-geo in which the value of each feature point in the SLAM coordinate system is associated with the information on the position of the corresponding reference feature point in the reference image data is obtained.
  • the conversion relationship acquisition unit 25 is a reprojection error when the Cmap-geo is reprojected onto the reference image data in the orientation information (information of six degrees of freedom) of the camera that captured the reference image data in the SLAM coordinate system. You get by minimizing This minimization is performed, for example, using the Levenberg-Marquardt method.
  • the conversion relationship processing unit 25 acquires a set of information on the posture of the camera that captured the reference image data in the SLAM coordinate system and the world coordinates associated with the reference image data, and based on these, a widely known method The transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system is obtained.
  • the transformation processing unit 26 transforms the SLAM map using the transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system.
  • the conversion processing unit 26 sequentially performs processing of initialization processing, pose graph optimization, and bundle adjustment.
  • the initialization process the following linear transformation is sequentially performed on the entire SLAM map obtained up to the time of processing using the correspondence relationship between the SLAM coordinate system of the SLAM map and the world coordinate system, CSLAM-World, This is a process that roughly matches the world coordinate system.
  • This initialization process is performed only at the beginning of the process and at a timing at which a predetermined condition is satisfied.
  • the initialization process is the timing at which the search unit 24 obtains a set of the target frame in which the feature points (the number of which is equal to or more than the predetermined threshold) are found and the reference image data.
  • i i is an integer of i ⁇ 2
  • the distance between the estimated position and the information on the position of the reference image data is a predetermined threshold (for example, 10 m) Do it when it exceeds.
  • the conversion processing unit 26 does not perform pose graph optimization and bundle adjustment processing until the first initialization is performed.
  • the conversion processing unit 26 rotates the SLAM map so that the plane coincides with the xz plane, assuming that the camera capturing the moving image data exists on the same plane as the first linear conversion in the initialization process.
  • the plane of the camera used at that time is estimated by principal component analysis of all the determined camera positions.
  • the similarity transformation is performed to bring the point p on the SLAM coordinate system in the first to i-th CSLAM-World closer to the point pworld on the corresponding world coordinate system ((1) formula).
  • the orientation of the camera capturing the object frame and the reference image data, and the position of the feature point of the SLAM map are transformed.
  • the first and second linear transformations here are both types of three-dimensional similarity transformations, and these transformations do not improve the scale drift.
  • the conversion processing unit 26 performs scale drift improvement processing by pose graph optimization each time position estimation is newly performed using the i-th reference image data.
  • the pose graph used here is, as illustrated in FIG. 3, a node Sn representing posture information of a camera that has captured a frame, a node Sm representing posture information of a camera that has captured reference image data, and a reference image
  • a first constraint relating to the posture between adjacent (temporally adjacent in the moving image data) nodes Sn including the node fp representing the position information associated with the data and representing the posture information of the camera that captured the frame Constraint) are mutually connected by e1. That is, the relative conversion between camera poses between adjacent frames is restricted by this constraint e1.
  • the node Sn corresponding to the target frame and the node Sm representing the posture information of the camera that captured the reference image data including the feature points corresponding to the feature points included in the They are connected to each other by the two constraints e2. That is, the relative conversion between the captured camera postures of the target node and the corresponding reference image data is restricted by the restriction e2.
  • the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information associated with the reference image data are mutually connected by the third constraint e3 relating to the distance. .
  • the conversion processing unit 26 suppresses the change of the camera's posture information at each node, and also suppresses the change of the moving direction of the camera. While, a change in scale is allowed to correct information on the movement path of the camera that has captured the frame. At this time, the conversion processing unit 26 causes the position of the reference image data associated with the target frame that is a part of the frame to be close to the information of the position in the world coordinate system originally associated with the reference image data. Make corrections.
  • the scale drift is improved by the constrained pose graph optimization in the three-dimensional similarity transformation group Sim (3).
  • pose graph optimization the camera posture is taken as an optimization variable, and optimization is performed in consideration of constraints on relative conversion between camera postures. That is, in the present embodiment, the pose graph is optimized by performing nonlinear deformation in consideration of scale drift using the pose graph.
  • the three-dimensional rigid transformation G belonging to the special Euclidean group SE (3) is defined by the following equation (2).
  • the rotation matrix A translation vector t (t is a vector quantity actually represented by a bold face, but is expressed as t in the description of this specification for convenience) is a three-dimensional vector quantity of a real component, s Is a nonnegative real number value.
  • the conversion from SE (3) to Sim (3) is performed by setting s of the scale component to 1 without changing R of the rotation matrix and t of the translation vector. That is, the three-dimensional similarity transformation S (S belongs to Sim (3)) is It becomes.
  • SO (3), SE (3) and Sim (3) all belong to the Lie group, are converted to the corresponding Lie algebra by the exponential mapping, and also define the inverse mapping, the logarithmic mapping.
  • Lie algebra is represented by a vector notation of coefficients.
  • the Lie algebra corresponding to Sim (3) is a seven-dimensional (seven degrees of freedom) vector including a component representing relative transformation in a six-dimensional (six degrees of freedom) vector representing posture information of a camera Notated, and its exponential map is It is defined as
  • T of the shoulder of a vector or matrix represents transposition (the same applies to the following).
  • W is a term similar to Rodriguez's formula.
  • the cost function associated with the constrained deformation of the pose graph is defined, and the cost function is minimized by the Levenberg-Marquardt method on Lie groups to obtain the original SLAM. While maintaining the structure of the map, the process of improving the scale drift and the process of bringing the corresponding points between the two coordinate systems of the SLAM coordinate system and the world coordinate system close to each other are performed at one time.
  • this logarithmic map is a 7-dimensional real vector It becomes.
  • the third constraint condition e3 relating to the distance between the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information ym associated with the reference image data is Using, I assume.
  • the minimization of e1i, j and e2k, l serves to suppress changes in relative transformations between camera poses except for gradual scale changes. Further, the minimization of e3m works to bring the camera position of the reference image data closer to the information of the position in the world coordinate system associated with the reference image data.
  • the pause graph used by the position estimation device 1 is as follows.
  • Node Sn Attitude of the camera when capturing the nth key frame.
  • Sn ⁇ Sim (3) n ⁇ ⁇ 1, 2, ... N ⁇ .
  • Node Sm Posture of the camera when imaging the mth reference image data.
  • Sm ⁇ Sim (3) m ⁇ ⁇ 1, 2, ... M ⁇ .
  • Edge e1i, j restriction due to relative conversion between camera poses when shooting the i, j th key frame. (I, j) ⁇ C1 Edge e 2 k, l: constraint due to relative conversion between camera poses when shooting the reference image data of the k th and l th.
  • N is the total number of key frames
  • M is the number of target frames (the total number of reference image data associated with the target frames)
  • C 1 is a set of key frames in which the same feature points in the SLAM map are captured
  • C2 represents a set of the target frame and the corresponding reference image data.
  • the conversion processing unit 26 executes a process of optimizing the pose graph defined as described above. That is, the conversion processing unit 26 extracts at least one set of key frame group C1 (including N key frames) in which a common feature point is imaged from the frames of the received moving image data. Also, referring to the set C2 (M sets of reference image data) including the target frame and the corresponding feature points found by the search unit 24, the cost function on the next Lie manifold is Levenberg's ⁇ Estimated estimated posture information S1, S2 ... of the camera by minimization according to the Marquardt method.
  • the conversion processing unit 26 reflects the conversion by the optimization also on the position of the feature point in the SLAM map. This reflection may be carried out using widely known methods such as those used in H. Strasdat, et. Al., “Scale drift-aware large scale monocular slam,” Robotics: Science and Systems VI, 2010. it can.
  • the node Sm representing the posture of the camera capturing the m-th reference image data and the k-th and l-th reference image data are captured from the pose graph used by Strasdat et al.
  • the distance between an edge e2k, l representing a restriction due to relative conversion between camera postures, a posture Sm of the camera that photographed the reference image data, and position information ym in the world coordinate system associated with the reference image data By adding an edge e3m to represent, the scale of the SLAM map is corrected by making the camera position of reference image data such as a geotag image not close to a loop closure but close to the value in the world coordinate system represented by position information (geotag). I am improving the drift.
  • the conversion processing unit 26 adds, to the received moving image data frame, a key frame group C1 in which a common feature point other than the key frame group C1 subjected to the process of optimizing the pose graph is captured. If there is, the process of optimizing the pose graph is sequentially performed on the key frame group C1.
  • the conversion processing unit 26 further deforms the SLAM map by bundle adjustment (BA) including the constraints determined in relation to the reference image data.
  • the conversion processing unit 26 performs the next bundle adjustment.
  • the conversion processing unit 26 of the present embodiment combines the Cfp-kf reprojection error and the Cfp-geo reprojection error together to minimize.
  • the conversion processing unit 26 which performs the bundle adjustment performs reprojection error ri, j between the i-th feature point and the j-th camera posture, Ask as.
  • Xi is a coordinate of the feature point in the SLAM coordinate system
  • xi is a two-dimensional coordinate within the frame of the feature point
  • Rj and tj represent rotation and translation of the j-th camera posture.
  • (fx, fy) is the focal length
  • (cx, cy) is the projection Represents the coordinates of the center.
  • the total cost function is defined as follows, including the restriction on the reference image data.
  • Tj is information on the posture of the camera that photographed the j-th key frame, and is expressed as an element of SE (3).
  • is a Huber robust cost function
  • C5 represents a feature point in a key frame of C1.
  • C1 and C3 are the already mentioned sets.
  • the feature points of the key frame and the information on the camera pose can be obtained by minimizing the total cost function on this Lie manifold. This minimization can be done using the Levenberg-Marquardt method.
  • not only the general Cfp-kf reprojection error but also the Cfp-geo reprojection error is minimized in order to combine the constraints with the position information related to the reference image data.
  • Ru the posture information of the camera that photographed the reference image data is fixed. This variation can further reduce the scale drift of the SLAM map, particularly when a sufficiently good initial solution is provided.
  • the position estimation device 1 of the present embodiment has the above configuration and operates as follows.
  • the position estimation device 1 receives moving image data including a series of frames obtained by imaging a subject on a moving route while moving, and executes the process illustrated in FIG. 4 below. Do.
  • the position estimation device 1 of the present embodiment extracts feature points related to the subject captured in each frame included in the motion image data, and reconstructs the feature points as a predetermined virtual three-dimensional space.
  • the SLAM map which is a reconstruction map associated with the coordinates in the space, is generated (S1: SLAM processing).
  • the position estimation device 1 selects at least one of the frames included in the received moving image data as a target frame (S2), and repeats the following processing for each target frame.
  • the position estimation device 1 sets the selected target frames one by one in a predetermined order as the target frame.
  • the position estimation apparatus 1 is reference image data in which at least one subject imaged in the target frame is imaged, and positional information in a predetermined world coordinate system representing a position imaged in advance is associated
  • the reference image data being acquired is acquired (S3).
  • this acquisition process may be performed by instructing the server on the network to search, or by instructing the user to input the reference image data. Good.
  • the position estimation device 1 searches for the reference feature point corresponding to the feature point found from the target frame in the acquired reference image data (feature point matching: S4).
  • the position estimation device 1 checks whether the number of reference feature points searched here is found more than a predetermined threshold (for example, five) (S5).
  • the position estimation device 1 captures the reference image data acquired in step S3 if the reference feature point retrieved in step S5 is equal to or greater than a predetermined threshold (S5: Yes).
  • the position and the attitude of the obtained camera are estimated (S6).
  • This processing corresponds to the feature point coordinates (two-dimensional coordinates) in the target frame, the feature point coordinates (three-dimensional coordinates) in the SLAM map coordinate system, and the feature points in the reference image data. This can be performed using the coordinates (two-dimensional coordinates) of the reference feature point to be detected and the position information (three-dimensional coordinate values in the world coordinate system) associated with the reference image data.
  • the position estimation device 1 returns to the process S3 and sets the next target frame as the target frame and continues the processing.
  • the process proceeds to the next processing (processing S10 described below).
  • the process returns to the process S3 without executing the process S6, and the next target frame is set as the target frame. Set and continue processing. Also here, if there is no target frame that has not been set as the target frame at that time (if the processing for all the target frames is completed), the process proceeds to the next processing (processing S10 described below).
  • the position estimation device 1 uses the world coordinate system, which is the coordinate system of the position information associated with each reference image data, and the coordinate system of the coordinate values of the feature points found from each target frame.
  • a transformation relationship with a coordinate system of a SLAM map can be obtained.
  • the position estimation device 1 performs transformation processing of the SLAM map.
  • the position estimation device 1 first determines whether to perform an initialization process (S10).
  • S10 initialization process
  • information of the position estimated as the position of the camera last time (this estimation will be described later) and the corresponding reference image It may be determined that the initialization process is to be performed when the distance from the position information associated with the data exceeds a predetermined threshold (for example, 10 m).
  • the position estimation apparatus 1 executes the initialization process (S11), and first assumes that the camera capturing the moving image data exists on the same plane, and the next process I do. That is, the position estimation device 1 performs principal component analysis of the information on the position based on the position of the camera (camera that has captured the received moving image data) in the SLAM map obtained so far. Determine the coordinate axes of the plane where the camera is supposed to be located (eg the axis of the normal to the plane), and make this normal parallel to the y-axis (so that the plane where the camera is located is the xz plane) SLAM Rotate the map.
  • the position estimation device 1 uses the frame and reference image data obtained so far and the position information (position information in the world coordinate system) associated with the reference image data. Shrink the SLAM coordinate system so that the sum (or the sum of squares) of the absolute value of the difference between the coordinate value of each feature point on the SLAM coordinate system and the value of the corresponding reference feature point in the world coordinate system becomes minimal Enlarge (similarity conversion).
  • Node Sn Attitude of the camera when capturing an n-th key frame; Sn (Sim (3), n ⁇ ⁇ 1, 2, ... N ⁇ , Node Sm: Posture of the camera when imaging the mth reference image data; Sm ⁇ Sim (3), m ⁇ ⁇ 1, 2, ...
  • Edge e1i, j constraint due to relative conversion between camera poses when shooting the i, j th key frame; (i, j) ⁇ C1, Edge e 2 k, l: constraint due to relative conversion between camera postures when shooting the reference image data of the k th and l th; (k, l) ⁇ C 2, Edge e3 m: distance between the posture Sm of the camera that captured the reference image data and the position information ym in the world coordinate system associated with the reference image data; m ⁇ ⁇ 1, 2, ... M ⁇ , Optimize the constrained pose graph in Sim (3) including (S12).
  • the cost value related to the change in the estimation result of the position and orientation represented by SLAM coordinates in the reconstruction space (SLAM map) of the camera that has captured the moving image data, and scaling of the movement amount of the camera An optimization process is performed using a cost function that includes a cost value and a cost value related to the distance between the position of the camera and the position information associated with the reference image data, and the SLAM of the camera that captured moving image data Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates, and bringing the position and orientation of the camera close to position information associated with the corresponding reference image data The correction is performed at one time.
  • the position estimation device 1 also performs bundle adjustment processing to correct the reprojection error of the feature point to the key frame and the reprojection error of the reference feature point to the reference image data (S13). As a result, an estimation result of the position and orientation represented by the world coordinate system in each target frame of the camera that has captured the moving image data can be obtained.
  • the position estimation device 1 outputs the information (result of estimation) of the position and orientation of the camera in the world coordinate system obtained in the processing up to this point (S14), and ends the processing.
  • the position estimation device 1 has been described as an example of receiving moving image data whose imaging has already been completed, but the present embodiment is not limited to this.
  • moving image data including a series of frames obtained by imaging an object on a moving path while moving is sequentially taken each time a part of frames (for example, one frame) is imaged.
  • the SLAM map is generated by sequential processing (that is, the feature points of the subject captured in the frame that can be sequentially received are extracted, and the feature points are extracted. Is associated with coordinates in the reconstruction space, which is a predetermined virtual three-dimensional space).
  • Such SLAM map generation processing is widely known as incremental SfM (Structure from Motion) or the like, and thus detailed description thereof is omitted here.
  • the position estimation device 1 sequentially performs the subsequent processing. That is, the position estimation device 1 determines whether or not the received frame is to be used as a key frame, and in the case of using it as a key frame, selects the key frame as an object frame and performs processing of step S3 and the following steps in FIG. You should do it.
  • the difference between the position of the camera estimated for a certain frame by the processing S12 and S13 in FIG. 4 and the position represented by the position information associated with the reference image data corresponding to the frame is a predetermined threshold. In the case of exceeding S.sub.0, the process may return to the process S10 and perform the initialization process again.
  • a position estimation method different from the embodiment of the present invention using reference image data in the following example, a geotag image
  • a position estimation method according to the embodiment of the present invention are compared to quantitatively evaluate the accuracy of the method of the embodiment in which the position estimation device 1 according to the embodiment of the present invention described above is implemented by a computer.
  • (1) initialization processing linear transformation: ILT
  • (2) optimization processing of constrained pose graph on Sim (3) PGO
  • About bundle modification processing BA which corrects so that the reprojection error to the reference image data of the re-projection error to the reference image data and the re-projection error to the reference image point becomes three steps of deformation processes (1) to (3) Examine the impact on accuracy of each process and its usefulness.
  • the Ma'laga Stereoand Laser Urban Data Set (Ma'laga data set) is used.
  • the video of this Ma'laga data set has a resolution of 1024 ⁇ 768 and a frame rate of 20 fps.
  • two types of video (Video 1 and 2) are cut out from the video and used for evaluation.
  • the path taken by the two types of images does not draw a loop, and the path length is 1 km or more.
  • all the frames are associated with GPS position information acquired every one second, a frame including an error of 10 m or more is also included.
  • the SLAM is used in the present embodiment, and the world coordinate system and the SLAM coordinate system are used.
  • the portion for acquiring the correspondence with (the reconstructed three-dimensional coordinate system) is replaced with the same processing as the present embodiment, and only the processing for deforming the SLAM map is compared.
  • the Kroeger method uses spline smoothing to smooth the camera pose.
  • the camera posture of the geotag image (reference image data) corresponding to the key frame is smoothed (Interpolation) using Cubic B-Spline.
  • Chang's method performs two-dimensional affine transformation on the xz plane so that the SLAM map restored from the input moving image data matches the 3D point group acquired from Google Street View, in the y-axis direction Performs scale conversion.
  • the method of applying the same conversion (Affine +) as the method of Chang to the SLAM map was used.
  • the method of the present embodiment is closer to GT than the other methods, and achieves a significant improvement in accuracy compared to the other methods.
  • the Kroeger method complements the correspondence between sparse geotags and images without considering the three-dimensional structure, a large error occurs in the position estimation where there are not enough corresponding points.
  • Chang's method uses a three-dimensional structure of the image, but uses a simple linear transformation, so it is greatly affected by distortion due to scale drift, and the accuracy is sufficiently high. In some cases, the position can not be estimated.
  • the absolute correspondence of the camera when each frame of moving image data is captured from the sparse correspondence between the geotag image and the captured moving image data using the result of three-dimensional reconstruction (SLAM) The position (the position in the world coordinate system, that is, information related to the latitude and longitude) was estimated.
  • SLAM three-dimensional reconstruction
  • processing for improving scale drift is integrated using sparse correspondence (association of only partial frames) between a geotag image and captured moving image data, and estimation of absolute position is performed. It has become possible to properly use the three-dimensional reconstruction structure.
  • the accuracy is considered to be improved as the error included in the position information such as the latitude and longitude associated with the reference image data is smaller.
  • the feature points in the frame included in the captured moving image data and the reference feature points included in the corresponding reference image data exist at a long distance from the position at which each image was captured, and a small number of them If there is, the matching accuracy between feature points will not be sufficient.
  • the camera points in a direction parallel to the roadway (moving locus), or when the angle of view of the camera is narrow the camera moves from the camera to the feature point group when estimating the position and orientation of the camera that captured the reference image data.
  • the error in the direction of is relatively large. This error mainly occurs in the direction parallel to the roadway.
  • Reference Signs List 1 position estimation apparatus 11 control unit, 12 storage unit, 13 operation unit, 14 display unit, 15 communication unit, 16 interface unit, 21 reception unit, 22 SLAM processing unit, 23 reference image data acquisition unit, 24 search unit, 25 Conversion relationship acquisition unit, 26 conversion processing unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

This position estimating device accepts moving image data including a series of frames obtained by capturing images of a subject on a movement pathway while moving, extracts a feature point relating to the subject captured in each frame included in the moving image data, and creates a reconfiguration map in which the feature point is associated with coordinates in a reconfiguration space, which is a prescribed virtual three-dimensional space. The position estimating device retrieves a reference feature point corresponding to the feature point in reference image data associated with position information, acquires a transformation relationship between a world coordinate system, which is a coordinate system of the position information, and a coordinate system of the coordinate associated with the feature point, and uses the transformation relationship to correct the reconfiguration map by performing correction to scale of an amount of movement of a camera which captured the moving image data, while suppressing a change in an estimated result of a position and attitude of the camera, represented using the coordinates in the reconfiguration space, and by performing correction such that the position and attitude of the camera approach the position information associated with the reference image data.

Description

位置推定装置、位置推定方法、及びプログラムPosition estimation device, position estimation method, and program
 本発明は、位置推定装置、位置推定方法、及びプログラムに関する。 The present invention relates to a position estimation device, a position estimation method, and a program.
 近年、自動車の自動運転技術の発展に伴って、カメラを搭載した種々の機器から、各国の道路等を走行しつつ撮像した動画像データ(走行映像)を収集するインフラが整備されつつある。 In recent years, with the development of automatic driving technology for automobiles, an infrastructure for collecting moving image data (traveling images) captured while traveling on roads in various countries is being developed from various devices equipped with cameras.
 このような多数の走行映像は、現在のところ単なる映像として視聴できるにとどまっているが、これらの走行映像から意味のある情報を抽出し、各走行映像に含まれる画像と、当該画像が撮像された、地図上の正確な位置とを関連付けることができれば、高度な地理情報システム(Geometric Infomation Systems(GIS))を構築でき、自動運転技術やロボティクスの発展に寄与すると期待される。 At present, such a large number of traveling videos can only be viewed as a simple video, but meaningful information is extracted from these traveling videos, and an image included in each traveling video and the image are captured. In addition, if it is possible to associate with an accurate position on a map, it is expected that an advanced geographical information system (GIS) can be constructed, which contributes to the development of automatic driving technology and robotics.
 このように、多数の走行映像に適用可能な、走行映像における走行位置推定を行うことが近年望まれている。 As described above, in recent years, it is desirable to perform traveling position estimation in traveling images that can be applied to a large number of traveling images.
 ところで、ジオタグ付きの画像を用いた映像の絶対的な位置を推定する方法は、ロードネットワークや衛星画像を用いた手法に比べて、十分に画像間の対応が取れれば、比較的精度の高い位置推定が可能という利点がある。一方で、入力される走行映像とジオタグ付きの画像との対応付けは、照明環境や撮影された画角の変化の影響を受けやすく、困難な場合がある。この対応付けが妥当でない場合は、位置推定ができなかったり、必要な精度での推定ができないこととなる。 By the way, the method of estimating the absolute position of the image using the image with the geotag is a position with relatively high accuracy if the correspondence between the images can be sufficiently obtained compared to the method using the road network or the satellite image There is an advantage that estimation is possible. On the other hand, the correspondence between the input traveling video and the geotagged image may be difficult due to the influence of the change of the illumination environment or the angle of view taken. If this association is not appropriate, position estimation can not be performed, or estimation with the required accuracy can not be performed.
 本発明は、このような実情に鑑みて為されたもので、ジオタグ付き画像との対応付けを直接的に行うことのできない場所を含む走行映像の各フレーム画像について、それぞれ対応する位置の情報を設定できる位置推定装置、位置推定方法、及びプログラムを提供することを、その目的の一つとする。 The present invention has been made in view of such actual circumstances, and for each frame image of a traveling image including a place where it is not possible to directly associate with a geotagged image, information of corresponding positions is provided. It is an object of the present invention to provide a position estimation device, a position estimation method, and a program that can be set.
 なお、非特許文献1には、SLAM(Simultaneous Localization and Mapping)により走行映像に撮像されている立体物の三次元的な情報を推定(復元)し、この推定により得られた三次元復元マップを、世界座標系内の対象物として変形することで、走行映像の位置推定を行うことが開示されているが、この非特許文献1の方法では、走行距離が長くなるほど、誤差が累積し、位置推定の精度が低下することが知られている。 In Non-Patent Document 1, three-dimensional information of a three-dimensional object captured in a traveling image is estimated (reconstructed) by SLAM (Simultaneous Localization and Mapping), and a three-dimensional reconstruction map obtained by this estimation is used. It is disclosed that the position estimation of the traveling image is performed by deforming it as an object in the world coordinate system, but in the method of this non-patent document 1, the error accumulates as the traveling distance becomes longer, and the position It is known that the estimation accuracy is reduced.
 上記従来例の問題点を解決する本発明は、位置推定装置であって、移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを受け入れる受入手段と、前記受け入れた動画像データに含まれるフレームの少なくとも一つを対象フレームとし、当該対象フレームに撮像された被写体が少なくとも一つ撮像されている参考画像データであって、予め撮像した位置を表す、所定の世界座標系での位置情報が関連付けられている参考画像データを取得する取得手段と、前記動画像データに含まれる各フレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付けた再構成マップを生成する再構成手段と、前記参考画像データにおいて当該特徴点に対応する参考特徴点を検索する検索手段と、前記検索された参考特徴点と、前記参考画像データに関連付けられた位置情報と、前記特徴点に関連付けられた座標とに基づいて、当該位置情報の座標系である世界座標系と、前記特徴点に関連付けられた座標の座標系との変換関係を取得する関係取得手段と、前記変換関係を用いて、前記再構成マップを補正する変換手段と、を含み、前記変換手段が、前記動画像データを撮像したカメラの前記再構成空間内の座標で表される位置及び姿勢の推定結果の変化を抑制しつつカメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、前記参考画像データに関連付けられた位置情報に近接させる補正とを行って、前記再構成マップを補正することとしたものである。 The present invention for solving the problems of the prior art is a position estimation apparatus, which receives moving image data including a series of frames obtained by imaging an object on a moving path while moving; Reference image data in which at least one of the frames included in the moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined world representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the coordinate system, feature points related to the subject imaged in each frame included in the moving image data, and extracting the feature points Reconstruction means for generating a reconstruction map associated with coordinates in a reconstruction space which is a virtual three-dimensional space; and corresponding to the feature point in the reference image data The coordinate system of the position information based on the search means for searching the reference feature point, the searched reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point A relationship acquiring unit that acquires a conversion relationship between the world coordinate system and the coordinate system of the coordinates associated with the feature point; and a conversion unit that corrects the reconstruction map using the conversion relationship. A correction for scaling the amount of movement of the camera while suppressing a change in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data; The position and orientation of the image are corrected to be close to the position information associated with the reference image data to correct the reconstruction map.
 本発明によると、ジオタグ付き画像との対応付けを直接的に行うことのできない場所を含む走行映像の各フレーム画像についても、それぞれ対応する位置の情報を設定できる。 According to the present invention, information of corresponding positions can be set for each frame image of a traveling video including a place where it is not possible to directly associate with a geotagged image.
本発明の実施の形態に係る位置推定装置の構成例を表すブロック図である。It is a block diagram showing the example of composition of the position estimating device concerning an embodiment of the invention. 本発明の実施の形態に係る位置推定装置の例を表す機能ブロック図である。It is a functional block diagram showing the example of the position estimating device concerning an embodiment of the invention. 本発明の実施の形態に係る位置推定装置が用いるポーズグラフの例を表す説明図である。It is an explanatory view showing an example of a pose graph which a position estimating device concerning an embodiment of the invention uses. 本発明の実施の形態に係る位置推定装置の動作例を表すフローチャート図である。It is a flowchart figure showing the operation example of the position estimating device concerning an embodiment of the invention. 本発明の実施例と従来例の比較を表す説明図である。It is explanatory drawing showing the comparison of the Example of this invention, and a prior art example. 本発明の実施例の効果を表す説明図である。It is an explanatory view showing the effect of the example of the present invention.
 本発明の実施の形態について図面を参照しながら説明する。本発明の実施の形態に係る位置推定装置1は、図1に例示するように、制御部11と、記憶部12と、操作部13と、表示部14と、通信部15と、インタフェース部16とを含んで構成される。 Embodiments of the present invention will be described with reference to the drawings. The position estimation device 1 according to the embodiment of the present invention is, as illustrated in FIG. 1, a control unit 11, a storage unit 12, an operation unit 13, a display unit 14, a communication unit 15, and an interface unit 16. And including.
 制御部11は、CPU等のプログラム制御デバイスであり、記憶部12に格納されたプログラムに従って動作する。本実施の形態では、この制御部11は、移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを受け入れる。制御部11は、ここで受け入れた動画像データに含まれるフレームの少なくとも一つを対象フレームとして、この対象フレームに撮像された被写体が少なくとも一つ撮像されている参考画像データであって、予め撮像した位置を表す世界座標での位置情報が関連付けられている参考画像データを取得する。 The control unit 11 is a program control device such as a CPU, and operates in accordance with a program stored in the storage unit 12. In the present embodiment, the control unit 11 receives moving image data including a series of frames obtained by capturing an object on a moving path while moving. The control unit 11 is reference image data in which at least one subject imaged in the target frame is imaged, with at least one of the frames included in the moving image data received here as the target frame, and is imaged in advance Reference image data associated with position information in world coordinates representing the position is acquired.
 また制御部11は、受け入れた動画像データに含まれる各フレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付けるSLAM(Simultaneous Localization and Mapping)としての処理を行うとともに、参考画像データにおいて当該特徴点に対応する参考特徴点を検索する。 Further, the control unit 11 extracts feature points related to the subject captured in each frame included in the received moving image data, and coordinates the feature points in the reconstruction space, which is a predetermined virtual three-dimensional space. While processing as SLAM (Simultaneous Localization and Mapping) to be associated with, the reference feature point corresponding to the feature point is searched in the reference image data.
 制御部11は、検索された参考特徴点と、参考画像データに関連付けられた位置情報と、特徴点に関連付けられた座標(この特徴点の座標系、つまり再構成空間内の座標系を以下、SLAM座標系と呼ぶ)とに基づいて、当該位置情報の座標系であるワールド座標系と、SLAM座標系との変換関係を取得する。 The control unit 11 sets the retrieved reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point (the coordinate system of this feature point, that is, the coordinate system in the reconstruction space is Based on the SLAM coordinate system, the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system is acquired.
 制御部11は、そして、この変換関係を用いて、各フレームに撮像された特徴点に関連付けられたSLAM座標系での座標を、ワールド座標系の座標の値に変換する。ここでワールド座標系は、三次元座標系(x,y,z)で表されるものとし、その(x,z)平面はメートル単位の直行平面座標系であるユニバーサル横メルカトル(UTM)直交座標系に対応するものとする。またy軸は地平面からの高度(こちらもメートル単位とする)に対応する。広く知られているように、UTM直交座標系上の値は緯度,経度の組の座標の値に変換可能である。 The control unit 11 then converts the coordinates in the SLAM coordinate system associated with the feature points imaged in each frame into the values of the coordinates in the world coordinate system, using this conversion relationship. Here, the world coordinate system is represented by a three-dimensional coordinate system (x, y, z), and the (x, z) plane is a universal horizontal Mercator (UTM) orthogonal coordinate which is an orthogonal plane coordinate system in meters. It corresponds to the system. The y-axis corresponds to the altitude from the ground plane (also in meters). As is widely known, values on the UTM Cartesian coordinate system can be converted to coordinate values of latitude and longitude pairs.
 本実施の形態において制御部11は、ここで位置情報の座標系であるワールド座標系と、SLAM座標系との変換関係を取得する際に、動画像データを撮像したカメラのSLAM座標系での座標で表される位置及び姿勢の推定結果の変化を抑制しつつカメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、参考画像データに関連付けられた位置情報に近接させる補正とを行って、補正後のSLAM座標系の座標で表される特徴点の座標と、対応する参考特徴点のワールド座標系での座標の値とを用いて、ワールド座標系と、SLAM座標系との変換関係を取得する。この制御部11の詳しい動作については後に述べる。 In the present embodiment, when acquiring the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system in this embodiment, the control unit 11 in the SLAM coordinate system of the camera which captured moving image data. A correction for scaling the movement amount of the camera while suppressing a change in the estimation result of the position and orientation represented by the coordinates, and a correction for bringing the position and orientation of the camera close to the position information associated with the reference image data Then, using the coordinates of the feature point represented by the coordinates of the SLAM coordinate system after correction and the values of the coordinates of the corresponding reference feature point in the world coordinate system, the world coordinate system and the SLAM coordinate system Get conversion relation. The detailed operation of the control unit 11 will be described later.
 記憶部12は、メモリデバイスやディスクデバイス等であり、制御部11によって実行されるプログラムを保持する。この記憶部12は、また、制御部11のワークメモリとしても動作する。 The storage unit 12 is a memory device, a disk device, or the like, and holds a program executed by the control unit 11. The storage unit 12 also operates as a work memory of the control unit 11.
 操作部13は、キーボードやマウス等であり、利用者の指示操作を受け入れて、当該指示操作の内容を制御部11に出力する。表示部14は、ディスプレイ等であり、制御部11から入力される指示に従い、情報を表示出力する。 The operation unit 13 is a keyboard, a mouse, or the like, receives an instruction operation of the user, and outputs the content of the instruction operation to the control unit 11. The display unit 14 is a display or the like, and displays and outputs information in accordance with an instruction input from the control unit 11.
 通信部15は、ネットワークインタフェース等であり、制御部11から入力される指示に従って、ネットワークを介して要求等の情報を送信する。また、この通信部15は、ネットワークを介して受信した情報を、制御部11に出力する。この通信部15は、例えばインターネット上のサーバからジオタグ画像等の参考画像を取得する際などに用いられる。インタフェース部16は、例えばUSBインタフェース等であり、カメラ等から入力される動画像データを制御部11に出力する。 The communication unit 15 is a network interface or the like, and transmits information such as a request via the network in accordance with an instruction input from the control unit 11. The communication unit 15 also outputs the information received via the network to the control unit 11. The communication unit 15 is used, for example, when acquiring a reference image such as a geotag image from a server on the Internet. The interface unit 16 is, for example, a USB interface or the like, and outputs moving image data input from a camera or the like to the control unit 11.
 本実施の形態の制御部11は、機能的には図2に例示するように、受入部21と、SLAM処理部22と、参考画像データ取得部23と、検索部24と、変換関係取得部25と、変換処理部26とを含んで構成されている。 The control unit 11 according to the present embodiment is functionally illustrated in FIG. 2 as an acceptance unit 21, a SLAM processing unit 22, a reference image data acquisition unit 23, a search unit 24, and a conversion relationship acquisition unit. 25 and the conversion processing unit 26.
 受入部21は、カメラを移動させつつ、移動経路上の被写体を当該カメラで撮像して得た一連のフレームを含む動画像データを受け入れる。このカメラは単眼カメラ(つまり奥行き方向の情報を取得しないカメラ)でよく、従って、撮像された動画像データに含まれるフレームには、奥行きの情報は含まれないものとする。 The receiving unit 21 receives moving image data including a series of frames obtained by capturing an object on a moving path with the camera while moving the camera. This camera may be a monocular camera (that is, a camera that does not acquire information in the depth direction), and therefore, it is assumed that frames included in captured moving image data do not include depth information.
 本実施の形態では、この制御部11は、受け入れた動画像データを用いて、SLAMにより三次元のマップを復元し、得られた三次元復元マップを、世界座標系に変形する。これにより、ジオタグ等、世界座標系に係る位置情報に関連付けられた画像との対応が直接取れない場所を撮像したフレームを含む、動画像データの全フレームに対し、位置情報を関連付ける。 In the present embodiment, the control unit 11 restores a three-dimensional map by SLAM using the received moving image data, and deforms the obtained three-dimensional restoration map into a world coordinate system. In this way, position information is associated with all frames of moving image data, including frames obtained by imaging locations such as geotags that can not be directly associated with an image associated with position information related to the world coordinate system.
 また本実施の形態における、動画像データからの三次元のマップを復元する処理では、復元の処理中に徐々にスケールの誤差が蓄積する問題(スケールドリフト問題)を考慮し、スケールドリフトを改善しつつワールド座標系にマッピングする。 Further, in the processing of restoring a three-dimensional map from moving image data in the present embodiment, the scale drift is improved in consideration of a problem that a scale error gradually accumulates during the restoration processing (scale drift problem). While mapping to the world coordinate system.
 SLAM処理部22は、動画像データに含まれる各フレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、再構成された三次元空間内の座標に関連付ける処理(SLAM処理)を実行する。そしてSLAM処理部22は、特徴点ごとに固有の特徴点識別情報を発行し、この特徴点識別情報と、当該特徴点を抽出したフレームを特定する情報と、当該特徴点の上記三次元空間内の座標とを関連付けて記憶部12に格納する。この処理は例えば、ORB-SLAM(R. Mur-Atral, et.al., “ORB-SLAM: a versatile and accurate monocular slam system”, IEEE Transactions on Robotics, vol. 31, no.5, pp.1147-1163, 2015)等を用いて行うことができ、その具体的な処理の内容は広く知られているので、ここでの説明は省略する。 The SLAM processing unit 22 extracts a feature point related to the subject captured in each frame included in the moving image data, and associates the feature point with the coordinates in the reconstructed three-dimensional space (SLAM process). Run. Then, the SLAM processing unit 22 issues unique feature point identification information for each feature point, the feature point identification information, information for identifying a frame from which the feature point is extracted, and the three-dimensional space of the feature point. Are associated with each other and stored in the storage unit 12. For example, this process is described in ORB-SLAM (R. Mur-Atral, et. Al., "ORB-SLAM: a versatile and accurate monocular slam system", IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147). -1, 2015, etc., and the contents of the specific processing are widely known, so the description here is omitted.
 以下では、ここで生成した、特徴点識別情報と仮想的な三次元空間の座標系(SLAM座標系)内の座標の値とを関連付けた情報を、以下、SLAMマップと呼ぶ。またこのSLAM処理では、全フレームのうちからキーフレーム(KF)を選択して、このキーフレーム内の各特徴点の位置を表す座標(フレームの画像内の画素の位置を表す二次元的座標、以下、この座標系をフレーム座標系と呼ぶ)と、SLAMマップ中のSLAM座標系での三次元座標の値とを関連付けた情報Cfp-kfを得る。SLAM処理部22は、この情報Cfp-kfを、記憶部12に格納しておく。 Hereinafter, information in which the feature point identification information generated here is associated with the coordinate values in the coordinate system (SLAM coordinate system) of the virtual three-dimensional space is hereinafter referred to as a SLAM map. In this SLAM process, a key frame (KF) is selected from all the frames, and coordinates representing the position of each feature point in the key frame (two-dimensional coordinates representing the position of the pixel in the image of the frame, Hereinafter, this coordinate system is referred to as a frame coordinate system), and information Cfp-kf is obtained in which values of three-dimensional coordinates in the SLAM coordinate system in the SLAM map are associated. The SLAM processing unit 22 stores the information Cfp-kf in the storage unit 12.
 参考画像データ取得部23は、また、受け入れた動画像データに含まれるフレームの少なくとも一つを対象フレームとして選択する。この選択は人為的に行ってもよいし、SLAM処理部22において、三次元空間の再構成処理においてキーフレームとして選択されたフレームを、そのまま対象フレームとして用いてもよい。 The reference image data acquisition unit 23 also selects at least one of the frames included in the received moving image data as a target frame. This selection may be performed artificially, or the frame selected as a key frame in the reconstruction processing of the three-dimensional space in the SLAM processing unit 22 may be used as the target frame as it is.
 参考画像データ取得部23は、選択した対象フレームに撮像された被写体と共通の被写体が少なくとも一つ撮像されている参考画像データであって、予め撮像した位置を表す世界座標での位置情報が関連付けられている参考画像データを取得する。 The reference image data acquisition unit 23 is reference image data in which at least one subject common to the subject captured in the selected target frame is captured, and positional information in world coordinates representing a position captured in advance is associated Get the reference image data that is being
 本実施の形態の一例では、受け入れられる動画像データが道路に沿って移動する車両等に搭載されたカメラで撮像されるものであるとする。この場合、参考画像データは、例えば米国グーグル社のGoogle Street View(https://www.google.com/streetview/)から検索することができる。このGoogle Street Viewは、路上を対象とした検索可能なGIS(地理情報システム)であり、世界各国を対象とした大規模なジオタグ画像データセットの一つである。このGoogle Street Viewでは、全てのジオタグ画像はパノラマ画像と、緯度経度情報とを関連付けた情報として与えられる。本実施の形態では、このパノラマ画像を、受け入れた動画像データから選択された対象フレームと同じ画角で、水平8方向に切り出すことで、ジオタグ画像群として使用する。 In an example of the present embodiment, it is assumed that moving image data to be received is captured by a camera mounted on a vehicle or the like moving along a road. In this case, the reference image data can be searched, for example, from Google Street View (https://www.google.com/streetview/) of Google Inc. in the United States. Google Street View is a searchable GIS (Geographic Information System) for the streets, and is one of the large geotag image datasets for countries around the world. In this Google Street View, all geotag images are given as information in which a panoramic image and latitude and longitude information are associated. In this embodiment, the panoramic image is cut out in eight horizontal directions at the same angle of view as the target frame selected from the received moving image data, and used as a geotag image group.
 もっとも本実施の形態において参考画像データは、Google Street Viewから取得する必要は必ずしもない。例えば、経路に交差点が含まれる場合、交差点にて撮影された画像のデータ(動画像データから取り出される画像データ以外の画像データ)と、当該交差点の位置情報(緯度経度情報)とがあれば、これをジオタグ画像として用いることができる。 However, in the present embodiment, the reference image data does not have to be acquired from Google Street View. For example, when the route includes an intersection, if there is data of an image taken at the intersection (image data other than image data extracted from moving image data) and position information (latitude and longitude information) of the intersection, This can be used as a geotag image.
 また、本実施の形態において、受け入れられる動画像データの移動経路は必ずしも屋外でなくてもよく、例えば店舗等の施設内を移動する経路であってもよい。この場合、参考画像データは、当該移動経路上で、別途(動画像データとは別に)撮影された画像データと、当該施設内に適宜設定したワールド座標系での座標の値が用いられる。一例としてこの場合のワールド座標系での値は、施設を平面視したときに、施設内の特定の点を原点ととり、この原点から東側にX軸正の向き、北側にY軸正の向きを設定して、メートル単位で(X,Y)直交座標系の値としてもよい。 Further, in the present embodiment, the moving route of the moving image data to be accepted may not necessarily be outdoors, and may be, for example, a route moving in a facility such as a store. In this case, as the reference image data, image data captured separately (apart from moving image data) on the moving path and coordinate values in the world coordinate system appropriately set in the facility are used. As an example, in the world coordinate system in this case, when the facility is viewed from above, a specific point in the facility is taken as the origin, and the positive direction of the X axis to the east and the positive direction of the Y axis to the north The value of (X, Y) in the unit of metric may be set as the value of the orthogonal coordinate system.
 また、上述のように、Google Street Viewのジオタグ画像にはワールド座標が割り当てられているため、ジオタグ画像のカメラ位置がSLAM座標系内で推定可能であれば、SLAMとワールド座標間の対応関係を取得することができる。 Also, as described above, since world coordinates are assigned to geotagging images in Google Street View, if the camera position of the geotagging image can be estimated in the SLAM coordinate system, the correspondence between SLAM and world coordinates is It can be acquired.
 参考画像データ取得部23は、例えば、受け入れた動画像データを撮影した地域を特定する情報(例えば市名、町名、あるいは指定された地点から指定された距離(例えば400メートル)内など)の入力を受けて、当該入力された地域内の緯度経度情報に関連付けられたジオタグ画像をGoogle Street Viewのサーバから取得し、それぞれのジオタグ画像のそれぞれを、受け入れた動画像データから選択された対象フレームと同じ画角で、水平8方向に切り出してジオタグ画像群とする。 The reference image data acquisition unit 23 inputs, for example, information (for example, a city name, a town name, or within a specified distance (for example, 400 meters) from a specified point) specifying an area where the received moving image data is captured. Receiving the geotagging image associated with the latitude and longitude information in the input area from the server of Google Street View, and each of the geotagging images with the target frame selected from the received moving image data At the same angle of view, cut out in eight horizontal directions to form a geotag image group.
 参考画像データ取得部23は、選択した対象フレームに類似するジオタグ画像を、類似度の高い順にk枚選択して、参考画像データとする。ここで類似度は、SIFT特徴量を用いたbag-of-wordsアプローチを利用する方法(Agarwal,W.Burgard,and L.Spinello,“Metric localization using google street view,” IROS, pp.3111-3118,2015.)等を採用すればよい。 The reference image data acquisition unit 23 selects k geotag images similar to the selected target frame in descending order of the degree of similarity as reference image data. Here, the degree of similarity is a method of using a bag-of-words approach using SIFT feature quantities (Agarwal, W. Burgard, and L. Spinello, “Metric localization using google street view,” IROS, pp. 3111-3118 , 2015.) etc. may be adopted.
 また別の例では、参考画像データ取得部23は、選択した対象フレームを表示出力して、利用者に対して、当該表示出力した対象フレームに近接したジオタグ画像を、例えばGoogle Street Viewが提供するジオタグ画像から選択させ、参考画像データとして当該選択されたジオタグ画像を取得することとしてもよい。 In another example, the reference image data acquisition unit 23 displays and outputs the selected target frame, and, for example, Google Street View provides the user with a geotag image in the vicinity of the target frame that has been displayed and output. The geotag image may be selected, and the selected geotag image may be acquired as reference image data.
 検索部24は、参考画像データ取得部23が選択した対象フレームと、参考画像データ取得部23が取得した参考画像データとのそれぞれから、対応する特徴点の組を取得する。具体的に、この検索部24は、対象フレームと参考画像データとのそれぞれから、ORB特徴点(E.Rublee, V.Rabaud, K.Konolige,and G.Bradski,“Orb: An efficient alternative to sift or surf,” ICCV,pp.2564-2571, 2011を参照)を検出する。そして、対象フレームと参考画像データのそれぞれから検出した特徴点のマッチングを行う。このマッチングの方法は広く知られた方法を採用すればよいが、検索部24は、VLD(Virtual Line Descriptor)を用いるなどして、誤った対応付けを行ったマッチングを除去しておくことが好適である。なお、VLDについては、Z.Liu and R.Marlet, “Virtual line descriptor and semilocal matching method for reliable feature correspondence,” BMVC,pp.16-1,2012に詳しい開示があり、広く知られているので、ここでの詳しい説明を省略する。 The search unit 24 acquires a set of corresponding feature points from each of the target frame selected by the reference image data acquisition unit 23 and the reference image data acquired by the reference image data acquisition unit 23. Specifically, the search unit 24 determines ORB feature points (E.Rublee, V.Rabaud, K.Konolige, and G.Bradski, “Orb: An effective alternative to sift” from the target frame and the reference image data, respectively). or surf, see ICCV, pp. 2564-2571, 2011). Then, matching of feature points detected from each of the target frame and the reference image data is performed. Although a widely known method may be adopted as a method of this matching, it is preferable that the search unit 24 remove a wrongly matched matching by using a VLD (Virtual Line Descriptor) or the like. It is. The details of VLD are disclosed in Z. Liu and R. Marlet, “Virtual line descriptor and semilocal matching method for reliable feature correspondence,” BMVC, pp. 16-1, 2012, and are widely known. Detailed explanation here is omitted.
 本実施の形態では、検索部24はさらに、一つの対象フレームに対応して複数の参考画像データが得られている場合(上述のように、類似度の高い順にk枚を選択した場合等)には、互いに対応する特徴点の数が予め定めたしきい値(例えば5とする)未満となっている参考画像データに係る情報を削除する。 In the present embodiment, the search unit 24 further obtains a plurality of reference image data corresponding to one target frame (as described above, for example, when k sheets are selected in descending order of similarity). In this case, the information related to the reference image data whose number of corresponding feature points is less than a predetermined threshold (for example, 5) is deleted.
 なお、このとき、対象フレームに対応して得られた参考画像データのうちに、対象フレームとの間で対応する特徴点(以下、区別の必要があるときは、参考画像データから見いだされる特徴点を参考特徴点と呼ぶ)の数が上記しきい値を超える参考画像データがない場合は、検索部24は、当該対象フレームを、選択から外す(対象フレームとしない)こととする。 At this time, among the reference image data obtained corresponding to the target frame, the corresponding feature points with the target frame (hereinafter, when it is necessary to distinguish, the feature points found from the reference image data) When there is no reference image data in which the number of reference feature points exceeds the threshold value, the search unit 24 excludes the target frame from selection (does not set it as a target frame).
 変換関係取得部25は、SLAM座標系とワールド座標系との間の変換関係CSLAM-Worldを推定する。具体的に変換関係取得部25は、まず、参考画像データを撮像したカメラの姿勢を推定する。すなわち変換関係取得部25は、SLAMマップに含まれる特徴点のうち、検索部24によって参考画像データのうちに対応する参考特徴点が見いだされたもののそれぞれついて、対応するSLAM座標系でのSLAMマップの特徴点の座標の値を取得する。また変換関係取得部25は、検索部24によって得られた、それぞれの特徴点に対応する参考画像データ中の参考特徴点の位置の情報(参考画像データ中の二次元的な位置の情報)を取得し、各特徴点のSLAM座標系での値と、参考画像データ中の対応する参考特徴点の位置の情報を関連付けた情報Cmap-geoを取得する。 The conversion relation acquisition unit 25 estimates a conversion relation CSLAM-World between the SLAM coordinate system and the world coordinate system. Specifically, the conversion relationship acquiring unit 25 first estimates the posture of the camera that has captured the reference image data. That is, the conversion relationship acquiring unit 25 sets the SLAM map in the corresponding SLAM coordinate system for each of the feature points included in the SLAM map for which the corresponding reference feature point is found in the reference image data by the search unit 24. Get the value of the coordinate of the feature point of. In addition, the conversion relationship acquisition unit 25 receives information on the position of reference feature points in the reference image data corresponding to each feature point (information on a two-dimensional position in the reference image data) obtained by the search unit 24. Information Cmap-geo in which the value of each feature point in the SLAM coordinate system is associated with the information on the position of the corresponding reference feature point in the reference image data is obtained.
 そして変換関係取得部25は、SLAM座標系での、参照画像データを撮影したカメラの姿勢情報(6自由度の情報)を、Cmap-geoを参考画像データ中に再投影したときの再投影誤差を最小化することで得る。この最小化は例えばレーベンバーグ・マルカート法を用いて行う。 Then, the conversion relationship acquisition unit 25 is a reprojection error when the Cmap-geo is reprojected onto the reference image data in the orientation information (information of six degrees of freedom) of the camera that captured the reference image data in the SLAM coordinate system. You get by minimizing This minimization is performed, for example, using the Levenberg-Marquardt method.
 変換関係処理部25は、SLAM座標系における参考画像データを撮像したカメラの姿勢の情報と、参考画像データに関連付けられたワールド座標の組を取得し、これらに基づいて広く知られた方法により、SLAM座標系とワールド座標系との変換関係CSLAM-Worldを得る。 The conversion relationship processing unit 25 acquires a set of information on the posture of the camera that captured the reference image data in the SLAM coordinate system and the world coordinates associated with the reference image data, and based on these, a widely known method The transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system is obtained.
 変換処理部26は、SLAM座標系とワールド座標系との間の変換関係CSLAM-Worldを用いて、SLAMマップを変形する。 The transformation processing unit 26 transforms the SLAM map using the transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system.
 この変形の処理において変換処理部26は、初期化処理、ポーズグラフ最適化、バンドル調整の処理を順次行う。初期化処理は、SLAMマップのSLAM座標系とワールド座標系間の対応関係CSLAM-Worldを用いて、処理の時点までに得られているSLAMマップの全体に、次の線形変換を順次行って、ワールド座標系に大まかに合わせる処理である。この初期化処理は、処理の始めと、所定の条件を満足したタイミングにのみ行う。具体的には、初期化処理は、検索部24により(所定しきい値以上の個数の)特徴点を見いだした対象フレームと参考画像データの組が得られたタイミングであって、1番目(処理の始め)のタイミングと、i番目(iはi≧2の整数)のタイミングであって、推定された位置と、参考画像データの位置の情報との距離が予め定めたしきい値(例えば10mなどとする)を越えた場合に行う。 In the processing of this modification, the conversion processing unit 26 sequentially performs processing of initialization processing, pose graph optimization, and bundle adjustment. In the initialization process, the following linear transformation is sequentially performed on the entire SLAM map obtained up to the time of processing using the correspondence relationship between the SLAM coordinate system of the SLAM map and the world coordinate system, CSLAM-World, This is a process that roughly matches the world coordinate system. This initialization process is performed only at the beginning of the process and at a timing at which a predetermined condition is satisfied. Specifically, the initialization process is the timing at which the search unit 24 obtains a set of the target frame in which the feature points (the number of which is equal to or more than the predetermined threshold) are found and the reference image data. Of i) (i is an integer of i ≧ 2), and the distance between the estimated position and the information on the position of the reference image data is a predetermined threshold (for example, 10 m) Do it when it exceeds.
 変換処理部26は、最初の初期化が行われるまではポーズグラフ最適化及びバンドル調整の各処理を行わない。 The conversion processing unit 26 does not perform pose graph optimization and bundle adjustment processing until the first initialization is performed.
 変換処理部26は、初期化処理では第1の線形変換として、動画像データを撮像したカメラが同一平面上に存在すると仮定して、その平面がxz平面に一致するようSLAMマップを回転させる。その際に用いるカメラの乗る平面は、求められた全てのカメラ位置の主成分分析により推定する。 The conversion processing unit 26 rotates the SLAM map so that the plane coincides with the xz plane, assuming that the camera capturing the moving image data exists on the same plane as the first linear conversion in the initialization process. The plane of the camera used at that time is estimated by principal component analysis of all the determined camera positions.
 第2の線形変換として、第1番目から第i番目までのCSLAM-WorldにおけるSLAM座標系上の点pを、対応するワールド座標系上の点pworldに近づけるような相似変換を行う((1)式)。 As a second linear transformation, the similarity transformation is performed to bring the point p on the SLAM coordinate system in the first to i-th CSLAM-World closer to the point pworld on the corresponding world coordinate system ((1) formula).
Figure JPOXMLDOC01-appb-M000001
この変換行列中の4つのパラメータ[a,b,s,θ]は、コスト関数Eを、
Figure JPOXMLDOC01-appb-I000002
として、非線形最小二乗問題をRANSAC(Random Sampling Consensus)とレーベンバーグ・マルカート法を用いて解くことにより推定される。なお、pworld,kは、第1番目から第i番目までのキーフレームと参考画像データとの組のうち、第k番目の組に含まれる参考画像データに関連付けられた位置の情報を意味する。
Figure JPOXMLDOC01-appb-M000001
The four parameters [a, b, s, θ] in this transformation matrix give the cost function E
Figure JPOXMLDOC01-appb-I000002
It is estimated by solving the nonlinear least squares problem using RANSAC (Random Sampling Consensus) and Levenberg-Marquardt method. Note that pworld, k means information of a position associated with reference image data included in the k-th set of the first to i-th set of key frames and reference image data.
 推定された変換行列により、対象フレーム及び参考画像データを撮影したカメラの姿勢や、SLAMマップの特徴点の位置(SLAM座標系での三次元的位置)を変換する。ここでの第1,第2の線形変換はどちらも三次元相似変換の一種であり、これらの変換によりスケールドリフトは改善されない。 Based on the estimated transformation matrix, the orientation of the camera capturing the object frame and the reference image data, and the position of the feature point of the SLAM map (three-dimensional position in the SLAM coordinate system) are transformed. The first and second linear transformations here are both types of three-dimensional similarity transformations, and these transformations do not improve the scale drift.
 初期化後、変換処理部26は、新たにi番目の参考画像データを用いた位置の推定が行われる度に、ポーズグラフ最適化によるスケールドリフトの改善処理を行う。具体的にここで用いるポーズグラフは、図3に例示するように、フレームを撮像したカメラの姿勢情報を表すノードSnと、参考画像データを撮像したカメラの姿勢情報を表すノードSmと、参考画像データに関連付けられた位置情報を表すノードfpとを含み、フレームを撮像したカメラの姿勢情報を表す隣接する(動画像データ中で時間的に隣接する)ノードSn間は姿勢に関する第1の制約(constraint)e1により互いに連結される。つまり、隣接するフレーム間のカメラ姿勢間の相対変換は、この制約e1によって規制される。 After initialization, the conversion processing unit 26 performs scale drift improvement processing by pose graph optimization each time position estimation is newly performed using the i-th reference image data. Specifically, as illustrated in FIG. 3, the pose graph used here is, as illustrated in FIG. 3, a node Sn representing posture information of a camera that has captured a frame, a node Sm representing posture information of a camera that has captured reference image data, and a reference image A first constraint relating to the posture between adjacent (temporally adjacent in the moving image data) nodes Sn including the node fp representing the position information associated with the data and representing the posture information of the camera that captured the frame Constraint) are mutually connected by e1. That is, the relative conversion between camera poses between adjacent frames is restricted by this constraint e1.
 またノードSnのうち、対象フレームに対応するノードSnと、この対象ノードに含まれる特徴点に対応する特徴点を含む参考画像データを撮像したカメラの姿勢情報を表すノードSmとの間は、第2の制約e2により互いに連結される。つまり、対象ノードと、対応する参考画像データとのそれぞれ撮像したカメラ姿勢間の相対変換は、この制約e2により規制される。 Further, among the nodes Sn, the node Sn corresponding to the target frame and the node Sm representing the posture information of the camera that captured the reference image data including the feature points corresponding to the feature points included in the They are connected to each other by the two constraints e2. That is, the relative conversion between the captured camera postures of the target node and the corresponding reference image data is restricted by the restriction e2.
 さらに、参考画像データを撮像したカメラの姿勢情報を表すノードSmと、当該参考画像データに関連付けられた位置情報を表すノードfpとの間は、距離に係る第3の制約e3により互いに連結される。 Further, the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information associated with the reference image data are mutually connected by the third constraint e3 relating to the distance. .
 本実施の形態の変換処理部26は、このポーズグラフを図3(b)に示すように、各ノードでのカメラの姿勢情報の変更を抑制しつつ、またカメラの移動方向の変更も抑制しつつ、スケールの変更を許容して、フレームを撮像したカメラの移動経路の情報を補正する。このとき変換処理部26は、一部のフレームである対象フレームに関連付けられた参考画像データの位置が、もともと当該参考画像データに関連付けられていたワールド座標系での位置の情報に近接するように補正を行う。 As shown in FIG. 3B, the conversion processing unit 26 according to the present embodiment suppresses the change of the camera's posture information at each node, and also suppresses the change of the moving direction of the camera. While, a change in scale is allowed to correct information on the movement path of the camera that has captured the frame. At this time, the conversion processing unit 26 causes the position of the reference image data associated with the target frame that is a part of the frame to be close to the information of the position in the world coordinate system originally associated with the reference image data. Make corrections.
 この処理の具体的な方法について次に説明する。ここでは、三次元相似変換群Sim(3)での制約付きポーズグラフ最適化によるスケールドリフトの改善を行う例について述べる。ポーズグラフ最適化では、カメラ姿勢を最適化変数に取り、カメラ姿勢間の相対変換に関する制約を考慮した最適化を行うこととなる。すなわち本実施の形態では、ポーズグラフを用いて、スケールドリフトを考慮した非線形変形を行ってポーズグラフを最適化する。 The specific method of this process will be described next. Here, an example will be described in which the scale drift is improved by the constrained pose graph optimization in the three-dimensional similarity transformation group Sim (3). In pose graph optimization, the camera posture is taken as an optimization variable, and optimization is performed in consideration of constraints on relative conversion between camera postures. That is, in the present embodiment, the pose graph is optimized by performing nonlinear deformation in consideration of scale drift using the pose graph.
 この目的のため、まず三次元相似変換群の表現について説明する。一般的に、6自由度のカメラ姿勢やカメラ姿勢間の相対変換は特殊ユークリッド群SE(3)の要素として表現される。一方で、本実施の形態において変換処理部26が行う最適化では、カメラ姿勢やその相対変換を、上述のようにSim(3)の要素として扱うこととする。 For this purpose, first, the representation of the three-dimensional similarity transformation group will be described. In general, relative transformations between camera postures and camera postures with six degrees of freedom are expressed as elements of the special Euclidean group SE (3). On the other hand, in the optimization performed by the conversion processing unit 26 in the present embodiment, the camera attitude and the relative conversion thereof are treated as elements of Sim (3) as described above.
 特殊ユークリッド群SE(3)に属する三次元剛体変換Gは、次の(2)式により定義される。ただし、回転行列
Figure JPOXMLDOC01-appb-I000003
、並進ベクトルt(tは実際には太字体で表されるベクトル量であるが、本明細書の説明中では便宜的にtと表記している)は実数成分の3次元のベクトル量、sは非負実数の値であるとする。
Figure JPOXMLDOC01-appb-I000004
The three-dimensional rigid transformation G belonging to the special Euclidean group SE (3) is defined by the following equation (2). However, the rotation matrix
Figure JPOXMLDOC01-appb-I000003
, A translation vector t (t is a vector quantity actually represented by a bold face, but is expressed as t in the description of this specification for convenience) is a three-dimensional vector quantity of a real component, s Is a nonnegative real number value.
Figure JPOXMLDOC01-appb-I000004
 ここで、SE(3)からSim(3)への変換は、回転行列のRと並進ベクトルのtを変化させず、スケール成分のsを1とすることで行う。すなわち、三次元相似変換S(SはSim(3)に属する)は、
Figure JPOXMLDOC01-appb-I000005
となる。
Here, the conversion from SE (3) to Sim (3) is performed by setting s of the scale component to 1 without changing R of the rotation matrix and t of the translation vector. That is, the three-dimensional similarity transformation S (S belongs to Sim (3)) is
Figure JPOXMLDOC01-appb-I000005
It becomes.
 SO(3),SE(3),Sim(3)は、いずれもリー群に属しており、指数写像により対応するリー代数に変換され、また、その逆変換である対数写像も定義される。ここではリー代数を係数のベクトル表記で表すものとする。例えばSim(3)に対応するリー代数は、カメラの姿勢情報を表す6次元(6自由度)のベクトル量に、相対変換を表す成分を含めた7次元(7自由度)のベクトル量
Figure JPOXMLDOC01-appb-I000006
で表記され、その指数写像は、
Figure JPOXMLDOC01-appb-I000007
と定義される。なお、ベクトル、あるいは行列の肩のTは転置を表す(以下同様)。
SO (3), SE (3) and Sim (3) all belong to the Lie group, are converted to the corresponding Lie algebra by the exponential mapping, and also define the inverse mapping, the logarithmic mapping. Here, it is assumed that Lie algebra is represented by a vector notation of coefficients. For example, the Lie algebra corresponding to Sim (3) is a seven-dimensional (seven degrees of freedom) vector including a component representing relative transformation in a six-dimensional (six degrees of freedom) vector representing posture information of a camera
Figure JPOXMLDOC01-appb-I000006
Notated, and its exponential map is
Figure JPOXMLDOC01-appb-I000007
It is defined as In addition, T of the shoulder of a vector or matrix represents transposition (the same applies to the following).
 また対数写像は、
Figure JPOXMLDOC01-appb-I000008
となる。
Also, the logarithmic map is
Figure JPOXMLDOC01-appb-I000008
It becomes.
 ここで、Wはロドリゲスの公式に類似する項である。この三次元相似変換群の表現を用いて、ポーズグラフの制約付きの変形に伴うコスト関数を定義し、コスト関数をリー群上でのレーベンバーグ・マルカート法で最小化することで、元のSLAMマップの構造を維持しつつ、スケールドリフトを改善する処理と、SLAM座標系とワールド座標系との2つの座標系間の対応点を互いに近づける処理を一度に行う。 Here, W is a term similar to Rodriguez's formula. Using this three-dimensional similarity transformation group representation, the cost function associated with the constrained deformation of the pose graph is defined, and the cost function is minimized by the Levenberg-Marquardt method on Lie groups to obtain the original SLAM. While maintaining the structure of the map, the process of improving the scale drift and the process of bringing the corresponding points between the two coordinate systems of the SLAM coordinate system and the world coordinate system close to each other are performed at one time.
 ここでリー群上でのレーベンバーグ・マルカート法の利用については、すでに、H. Strasdat, J. Montiel and A. J. Davison: “Scale drift- aware large scale monocular slam”, Robotics: Science and Systems VI (2010)に示されている通りであるので、ここでの詳細な説明は省略する。 Here, H. Strasdat, J. Montiel and A. J. Davison: “Drift scale-aware large scale monocular slam”, Robotics: Science and Systems VI, for use of the Levenberg-Marquardt method on the Lie group As it is as shown in (2010), the detailed description here is omitted.
 さて、上記の三次元相似変換を用いて、フレームを撮像したカメラの姿勢情報を表す隣接する(動画像データ中で時間的に隣接する)ノードSi,Sj間のカメラの姿勢に関する第1の制約条件e1を、
Figure JPOXMLDOC01-appb-I000009
とする。
Now, using the above three-dimensional similarity transformation, a first restriction on the camera's posture between adjacent (in time-adjacent in the moving image data) nodes Si and Sj representing the posture information of the camera that captured the frame Condition e1
Figure JPOXMLDOC01-appb-I000009
I assume.
 ここで
Figure JPOXMLDOC01-appb-I000010
は、最適化前のSiとSjとの間の相対変換をSim(3)に変換したものであり、この値は、最適化処理の間、固定値とする。
here
Figure JPOXMLDOC01-appb-I000010
Is the relative transformation between Si and Sj before optimization converted to Sim (3), and this value is a fixed value during the optimization process.
 すでに述べたように、この対数写像は、7次元の実数ベクトル
Figure JPOXMLDOC01-appb-I000011
となる。
As already mentioned, this logarithmic map is a 7-dimensional real vector
Figure JPOXMLDOC01-appb-I000011
It becomes.
 同様に、対象ノードと、対応する参考画像データとのそれぞれ撮像したカメラ姿勢間の相対変換に係る制約条件e2を、
Figure JPOXMLDOC01-appb-I000012
とする。
Similarly, a constraint condition e2 relating to relative conversion between the camera postures respectively captured between the target node and the corresponding reference image data,
Figure JPOXMLDOC01-appb-I000012
I assume.
 また、参考画像データを撮像したカメラの姿勢情報を表すノードSmと、当該参考画像データに関連付けられた位置情報ymを表すノードfpとの間の、距離に係る第3の制約条件e3は、
Figure JPOXMLDOC01-appb-I000013
を用いて、
Figure JPOXMLDOC01-appb-I000014
とする。
Further, the third constraint condition e3 relating to the distance between the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information ym associated with the reference image data is
Figure JPOXMLDOC01-appb-I000013
Using,
Figure JPOXMLDOC01-appb-I000014
I assume.
 これらのうち、e1i,j及びe2k,lの最小化は、ゆるやかなスケールの変化を除いてカメラ姿勢間の相対変換の変化を抑えるよう働く。また、e3mの最小化は、参考画像データのカメラの位置を、参考画像データに関連付けられたワールド座標系での位置の情報に近づけるよう働く。 Of these, the minimization of e1i, j and e2k, l serves to suppress changes in relative transformations between camera poses except for gradual scale changes. Further, the minimization of e3m works to bring the camera position of the reference image data closer to the information of the position in the world coordinate system associated with the reference image data.
 ここまでの説明をまとめると、本実施の形態の位置推定装置1が用いるポーズグラフ次のようになる。
・ノードSn:n番目のキーフレームを撮像したときのカメラの姿勢。Sn∈Sim(3),n∈{1,2,…N}。
・ノードSm:m番目の参考画像データを撮像したときのカメラの姿勢。Sm∈Sim(3),m∈{1,2,…M}。
・エッジe1i,j:i,j番目のキーフレームを撮影したときのカメラの姿勢間の相対変換による制約。(i,j)∈C1
・エッジe2k,l:k,l番目の参考画像データを撮影したときのカメラの姿勢間の相対変換による制約。(k,l)∈C2
・エッジe3m:参考画像データを撮影したカメラの姿勢Smと、当該参考画像データに関連付けられたワールド座標系での位置情報ymとの距離。m∈{1,2,…M}。
The pause graph used by the position estimation device 1 according to the present embodiment is as follows.
Node Sn: Attitude of the camera when capturing the nth key frame. Sn ∈ Sim (3), n ∈ {1, 2, ... N}.
Node Sm: Posture of the camera when imaging the mth reference image data. Sm ∈ Sim (3), m ∈ {1, 2, ... M}.
Edge e1i, j: restriction due to relative conversion between camera poses when shooting the i, j th key frame. (I, j) ∈ C1
Edge e 2 k, l: constraint due to relative conversion between camera poses when shooting the reference image data of the k th and l th. (K, l) ∈ C2
Edge e3 m: Distance between the attitude Sm of the camera that captured the reference image data and the position information ym in the world coordinate system associated with the reference image data. m ∈ {1, 2, ... M}.
 なお、Nはキーフレームの総数、Mは対象フレームの数(対象フレームに対応付けられた参考画像データの総数)、C1はSLAMマップ内の同一の特徴点が撮像されているキーフレームの組、C2は対象フレームと対応する参考画像データとの組を意味する。 Here, N is the total number of key frames, M is the number of target frames (the total number of reference image data associated with the target frames), C 1 is a set of key frames in which the same feature points in the SLAM map are captured, C2 represents a set of the target frame and the corresponding reference image data.
 変換処理部26は、既に述べた初期化処理の後、上述のように定義されるポーズグラフを最適化する処理を実行する。すなわち変換処理部26は、受け入れた動画像データのフレームから、共通の特徴点が撮像されているキーフレーム群C1(N個のキーフレームを含むとする)を少なくとも1セット抽出する。また、検索部24が見いだした、対象フレームとそれに対応する特徴点を含む参考画像データの組C2(M組あるものとする)を参照し、次のリー多様体上のコスト関数を、レーベンバーグ・マルカート法により最小化することで、カメラの推定姿勢情報S1,S2…を推定する。
Figure JPOXMLDOC01-appb-I000015
After the initialization process described above, the conversion processing unit 26 executes a process of optimizing the pose graph defined as described above. That is, the conversion processing unit 26 extracts at least one set of key frame group C1 (including N key frames) in which a common feature point is imaged from the frames of the received moving image data. Also, referring to the set C2 (M sets of reference image data) including the target frame and the corresponding feature points found by the search unit 24, the cost function on the next Lie manifold is Levenberg's · Estimated estimated posture information S1, S2 ... of the camera by minimization according to the Marquardt method.
Figure JPOXMLDOC01-appb-I000015
 更に、変換処理部26は、この最適化による変換をSLAMマップにおける、特徴点の位置にも反映させる。この反映は、例えばH. Strasdat, et.al., “Scale drift-aware large scale monocular slam,” Robotics: Science and Systems VI, 2010で用いられる方法等広く知られた方法を採用して行うことができる。 Furthermore, the conversion processing unit 26 reflects the conversion by the optimization also on the position of the feature point in the SLAM map. This reflection may be carried out using widely known methods such as those used in H. Strasdat, et. Al., “Scale drift-aware large scale monocular slam,” Robotics: Science and Systems VI, 2010. it can.
 このように本実施の形態では、このStrasdatらが用いたポーズグラフに対し、m番目の参考画像データを撮像したカメラの姿勢を表すノードSmと、k,l番目の参考画像データを撮影したときのカメラの姿勢間の相対変換による制約を表すエッジe2k,lと、参考画像データを撮影したカメラの姿勢Smと、当該参考画像データに関連付けられたワールド座標系での位置情報ymとの距離を表すエッジe3mを追加することで、ループクロージャではなく、ジオタグ画像など参考画像データのカメラ位置を、位置情報(ジオタグ)の表すワールド座標系での値に近づける補正を行うことで、SLAMマップのスケールドリフトを改善している。 As described above, in the present embodiment, when the node Sm representing the posture of the camera capturing the m-th reference image data and the k-th and l-th reference image data are captured from the pose graph used by Strasdat et al. The distance between an edge e2k, l representing a restriction due to relative conversion between camera postures, a posture Sm of the camera that photographed the reference image data, and position information ym in the world coordinate system associated with the reference image data By adding an edge e3m to represent, the scale of the SLAM map is corrected by making the camera position of reference image data such as a geotag image not close to a loop closure but close to the value in the world coordinate system represented by position information (geotag). I am improving the drift.
 また変換処理部26は、受け入れた動画像データのフレームに、上記ポーズグラフの最適化の処理を行ったキーフレーム群C1とは別の、共通の特徴点が撮像されているキーフレーム群C1があれば、当該キーフレーム群C1についても順次、上記ポーズグラフの最適化の処理を行う。 In addition, the conversion processing unit 26 adds, to the received moving image data frame, a key frame group C1 in which a common feature point other than the key frame group C1 subjected to the process of optimizing the pose graph is captured. If there is, the process of optimizing the pose graph is sequentially performed on the key frame group C1.
 変換処理部26は、さらに、参考画像データとの関係で定められる制約を含めたバンドル調整(BA)によりSLAMマップの変形を行う。 The conversion processing unit 26 further deforms the SLAM map by bundle adjustment (BA) including the constraints determined in relation to the reference image data.
 すなわち変換処理部26は、ポーズグラフの最適化によりカメラの姿勢を推定してSLAMマップを補正した後、次のバンドル調整を行う。本実施の形態の変換処理部26は、Cfp-kfの再投影誤差と、Cfp-geoの再投影誤差とを合わせて最小化する。 That is, after the pose of the camera is estimated by optimizing the pose graph and the SLAM map is corrected, the conversion processing unit 26 performs the next bundle adjustment. The conversion processing unit 26 of the present embodiment combines the Cfp-kf reprojection error and the Cfp-geo reprojection error together to minimize.
 具体的に、このバンドル調整を行う変換処理部26は、i番目の特徴点とj番目のカメラの姿勢との間の再投影誤差ri,jを、
Figure JPOXMLDOC01-appb-I000016
として求める。
Specifically, the conversion processing unit 26 which performs the bundle adjustment performs reprojection error ri, j between the i-th feature point and the j-th camera posture,
Figure JPOXMLDOC01-appb-I000016
Ask as.
 ここで
Figure JPOXMLDOC01-appb-I000017
であり、Xiは、特徴点のSLAM座標系での座標、xiは、特徴点のフレーム内の二次元座標、Rj及びtjは、j番目のカメラの姿勢の回転と並進移動とを表す。関数πは三次元座標(p=[px,py,pz]T)を、二次元の座標系に投影する関数であり、(fx,fy)は焦点距離、(cx,cy)は、投影の中心の座標を表す。
here
Figure JPOXMLDOC01-appb-I000017
Where Xi is a coordinate of the feature point in the SLAM coordinate system, xi is a two-dimensional coordinate within the frame of the feature point, and Rj and tj represent rotation and translation of the j-th camera posture. The function π is a function that projects three-dimensional coordinates (p = [px, py, pz] T ) onto a two-dimensional coordinate system, (fx, fy) is the focal length, and (cx, cy) is the projection Represents the coordinates of the center.
 また、参考画像データの位置情報をバンドル調整において反映させるため、参考画像データについての制約を含め、全コスト関数を次のように規定する。
Figure JPOXMLDOC01-appb-I000018
Further, in order to reflect the position information of the reference image data in the bundle adjustment, the total cost function is defined as follows, including the restriction on the reference image data.
Figure JPOXMLDOC01-appb-I000018
 ここでTjは、j番目のキーフレームを撮影したカメラの姿勢の情報であり、SE(3)の要素として表現される。ρは、フーバー関数(Huber robust cost function)であり、C5は、C1のキーフレーム中の特徴点を表す。C1,C3は既に述べた集合である。 Here, Tj is information on the posture of the camera that photographed the j-th key frame, and is expressed as an element of SE (3). ρ is a Huber robust cost function, and C5 represents a feature point in a key frame of C1. C1 and C3 are the already mentioned sets.
 キーフレームに係る特徴点とカメラの姿勢の情報とは、このリー多様体上の全コスト関数を最小化することで得られる。この最小化は、レーベンバーグ・マルカート法を用いて行うことができる。 The feature points of the key frame and the information on the camera pose can be obtained by minimizing the total cost function on this Lie manifold. This minimization can be done using the Levenberg-Marquardt method.
 このように本実施の形態では、参考画像データに係る位置情報との制約を組み合わせるため、一般的なCfp-kfの再投影誤差だけでなく、Cfp-geoの再投影誤差も合わせて最小化される。なお、バンドル調整の間、参考画像データを撮影したカメラの姿勢情報は固定しておく。この変形により、特に十分に良い初期解が与えられた場合に、SLAMマップのスケールドリフトをより低減できる。 As described above, in the present embodiment, not only the general Cfp-kf reprojection error but also the Cfp-geo reprojection error is minimized in order to combine the constraints with the position information related to the reference image data. Ru. During bundle adjustment, the posture information of the camera that photographed the reference image data is fixed. This variation can further reduce the scale drift of the SLAM map, particularly when a sufficiently good initial solution is provided.
[動作]
 本実施の形態の位置推定装置1は以上の構成を備え、次のように動作する。本実施の形態の一例では、位置推定装置1は、移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを受け入れて、次の図4に例示する処理を実行する。
[Operation]
The position estimation device 1 of the present embodiment has the above configuration and operates as follows. In an example of the present embodiment, the position estimation device 1 receives moving image data including a series of frames obtained by imaging a subject on a moving route while moving, and executes the process illustrated in FIG. 4 below. Do.
 本実施の形態の位置推定装置1は、記動画像データに含まれる各フレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付けた再構成マップであるSLAMマップを生成する(S1:SLAM処理)。 The position estimation device 1 of the present embodiment extracts feature points related to the subject captured in each frame included in the motion image data, and reconstructs the feature points as a predetermined virtual three-dimensional space. The SLAM map, which is a reconstruction map associated with the coordinates in the space, is generated (S1: SLAM processing).
 位置推定装置1は次に、受け入れた動画像データに含まれるフレームの少なくとも一つを対象フレームとして選択し(S2)、以下の処理を対象フレームごとに繰り返す。 Next, the position estimation device 1 selects at least one of the frames included in the received moving image data as a target frame (S2), and repeats the following processing for each target frame.
 すなわち位置推定装置1は、選択した対象フレームを所定の順番に一つずつ注目対象フレームとして設定する。そして位置推定装置1は、当該注目対象フレームに撮像された被写体が少なくとも一つ撮像されている参考画像データであって、予め撮像した位置を表す、所定の世界座標系での位置情報が関連付けられている参考画像データを取得する(S3)。 That is, the position estimation device 1 sets the selected target frames one by one in a predetermined order as the target frame. The position estimation apparatus 1 is reference image data in which at least one subject imaged in the target frame is imaged, and positional information in a predetermined world coordinate system representing a position imaged in advance is associated The reference image data being acquired is acquired (S3).
 既に述べたように、この取得の処理は、ネットワーク上のサーバに対して検索を指示することで行ってもよいし、利用者に対して参考画像データを入力するよう指示することで行ってもよい。 As described above, this acquisition process may be performed by instructing the server on the network to search, or by instructing the user to input the reference image data. Good.
 位置推定装置1は、取得した参考画像データにおいて、注目対象フレームから見いだされた特徴点に対応する参考特徴点を検索する(特徴点マッチング:S4)。位置推定装置1は、ここで検索された参考特徴点が予め定めたしきい値(例えば5個)より多くの数だけ見いだされたか否かを調べる(S5)。位置推定装置1は、この処理S5において、検索された参考特徴点が予め定めたしきい値以上であれば(S5:Yes)、位置推定装置1は、処理S3で取得した参考画像データを撮影したカメラの位置及び姿勢を推定する(S6)。この処理は、注目対象フレームにおける特徴点の座標(二次元の座標)と、SLAMマップの座標系での特徴点の座標(三次元の座標)と、参考画像データ中の、当該特徴点に対応する参考特徴点の座標(二次元の座標)と、当該参考画像データに関連付けられている、位置情報(ワールド座標系での三次元の座標値)とを用いて行うことができる。 The position estimation device 1 searches for the reference feature point corresponding to the feature point found from the target frame in the acquired reference image data (feature point matching: S4). The position estimation device 1 checks whether the number of reference feature points searched here is found more than a predetermined threshold (for example, five) (S5). The position estimation device 1 captures the reference image data acquired in step S3 if the reference feature point retrieved in step S5 is equal to or greater than a predetermined threshold (S5: Yes). The position and the attitude of the obtained camera are estimated (S6). This processing corresponds to the feature point coordinates (two-dimensional coordinates) in the target frame, the feature point coordinates (three-dimensional coordinates) in the SLAM map coordinate system, and the feature points in the reference image data. This can be performed using the coordinates (two-dimensional coordinates) of the reference feature point to be detected and the position information (three-dimensional coordinate values in the world coordinate system) associated with the reference image data.
 位置推定装置1は、処理S3に戻って、次の対象フレームを注目対象フレームとして設定して処理を続ける。なお、ここで注目対象フレームとして未設定の対象フレームがなければ(全ての対象フレームについての処理を終了したならば)、次の処理(次に述べる処理S10)に移行する。 The position estimation device 1 returns to the process S3 and sets the next target frame as the target frame and continues the processing. Here, if there is no target frame that has not been set as a target frame of interest (if the processing for all the target frames is completed), the process proceeds to the next processing (processing S10 described below).
 なお、処理S5において検索された参考特徴点が予め定めたしきい値未満であれば(S5:No)、処理S6を実行することなく処理S3に戻って、次の対象フレームを注目対象フレームとして設定して処理を続ける。ここでも、そのときに注目対象フレームとして未設定の対象フレームがなければ(全ての対象フレームについての処理を終了したならば)、次の処理(次に述べる処理S10)に移行する。 If the reference feature point searched in the process S5 is less than the predetermined threshold (S5: No), the process returns to the process S3 without executing the process S6, and the next target frame is set as the target frame. Set and continue processing. Also here, if there is no target frame that has not been set as the target frame at that time (if the processing for all the target frames is completed), the process proceeds to the next processing (processing S10 described below).
 この処理S3からS6の処理により、位置推定装置1は、各参考画像データに関連付けられた位置情報の座標系であるワールド座標系と、各対象フレームから見いだした特徴点の座標値の座標系である、SLAMマップの座標系との変換関係を取得できる。 By the processes from S3 to S6, the position estimation device 1 uses the world coordinate system, which is the coordinate system of the position information associated with each reference image data, and the coordinate system of the coordinate values of the feature points found from each target frame. A transformation relationship with a coordinate system of a SLAM map can be obtained.
 次に位置推定装置1は、SLAMマップの変形処理を行う。この処理では位置推定装置1は、まず初期化処理を行うか否かを判断する(S10)。ここでは例えば過去に初期化処理が行われていない場合に初期化処理を行うと判断することとしてもよい。また後に述べる、逐次的な処理を行う場合は、上記の判断に代えて、あるいはそれとともに、前回カメラの位置として推定された位置の情報(後にこの推定については説明する)と、対応する参考画像データに関連付けられた位置の情報との距離が予め定めたしきい値(例えば10mなどとする)を越えたときに初期化処理を行うと判断してもよい。 Next, the position estimation device 1 performs transformation processing of the SLAM map. In this process, the position estimation device 1 first determines whether to perform an initialization process (S10). Here, for example, when the initialization process has not been performed in the past, it may be determined to perform the initialization process. Also, when performing sequential processing described later, instead of or together with the above determination, information of the position estimated as the position of the camera last time (this estimation will be described later) and the corresponding reference image It may be determined that the initialization process is to be performed when the distance from the position information associated with the data exceeds a predetermined threshold (for example, 10 m).
 位置推定装置1は、処理S10において初期化処理を行うと判断したときには、初期化処理を実行し(S11)、まず動画像データを撮像したカメラが同一平面上に存在すると仮定して次の処理を行う。すなわち位置推定装置1は、ここまでに求められているSLAMマップ中でのカメラ(受け入れた動画像データを撮像したカメラ)の位置に基づいて、当該位置の情報を主成分分析することで、当該カメラが所在すると推定される平面の座標軸(例えば当該平面に立つ法線の軸)を決定し、この法線がy軸に平行となるよう(カメラの所在する平面がxz平面となるよう)SLAMマップを回転させる。 When it is determined that the initialization process is to be performed in the process S10, the position estimation apparatus 1 executes the initialization process (S11), and first assumes that the camera capturing the moving image data exists on the same plane, and the next process I do. That is, the position estimation device 1 performs principal component analysis of the information on the position based on the position of the camera (camera that has captured the received moving image data) in the SLAM map obtained so far. Determine the coordinate axes of the plane where the camera is supposed to be located (eg the axis of the normal to the plane), and make this normal parallel to the y-axis (so that the plane where the camera is located is the xz plane) SLAM Rotate the map.
 また、位置推定装置1は、この初期化処理において、ここまでに求められているフレームと参考画像データと、参考画像データに関連付けられた位置情報(ワールド座標系での位置情報)とを用いて、SLAM座標系上の各特徴点の座標値と、対応する参考特徴点のワールド座標系での値との差の絶対値の和(あるいは二乗和)が極小となるよう、SLAM座標系を縮小拡大(相似変換)する。 Further, in this initialization process, the position estimation device 1 uses the frame and reference image data obtained so far and the position information (position information in the world coordinate system) associated with the reference image data. Shrink the SLAM coordinate system so that the sum (or the sum of squares) of the absolute value of the difference between the coordinate value of each feature point on the SLAM coordinate system and the value of the corresponding reference feature point in the world coordinate system becomes minimal Enlarge (similarity conversion).
 位置推定装置1は、初期化処理を行わないと判断したとき、あるいは初期化処理の後、さらに、図3に例示したような、
・ノードSn:n番目のキーフレームを撮像したときのカメラの姿勢;Sn∈Sim(3),n∈{1,2,…N},
・ノードSm:m番目の参考画像データを撮像したときのカメラの姿勢;Sm∈Sim(3),m∈{1,2,…M},
・エッジe1i,j:i,j番目のキーフレームを撮影したときのカメラの姿勢間の相対変換による制約;(i,j)∈C1,
・エッジe2k,l:k,l番目の参考画像データを撮影したときのカメラの姿勢間の相対変換による制約;(k,l)∈C2,
・エッジe3m:参考画像データを撮影したカメラの姿勢Smと、当該参考画像データに関連付けられたワールド座標系での位置情報ymとの距離;m∈{1,2,…M},
を含むSim(3)での制約つきのポーズグラフを最適化する(S12)。
When the position estimation device 1 determines that the initialization process is not performed, or after the initialization process, as illustrated in FIG.
Node Sn: Attitude of the camera when capturing an n-th key frame; Sn (Sim (3), n ∈ {1, 2, ... N},
Node Sm: Posture of the camera when imaging the mth reference image data; Sm ∈ Sim (3), m ∈ {1, 2, ... M},
Edge e1i, j: constraint due to relative conversion between camera poses when shooting the i, j th key frame; (i, j) ∈C1,
Edge e 2 k, l: constraint due to relative conversion between camera postures when shooting the reference image data of the k th and l th; (k, l) ∈C 2,
Edge e3 m: distance between the posture Sm of the camera that captured the reference image data and the position information ym in the world coordinate system associated with the reference image data; m ∈ {1, 2, ... M},
Optimize the constrained pose graph in Sim (3) including (S12).
 この処理S12により、動画像データを撮像したカメラの再構成空間(SLAMマップ)内のSLAM座標で表される位置及び姿勢の推定結果の変化に係るコスト値と、カメラの移動量のスケーリングに係るコスト値と、当該カメラの位置と参考画像データに関連付けられた位置情報との距離に係るコスト値とを含むコスト関数を用いた最適化の処理が行われ、動画像データを撮像したカメラのSLAM座標で表される位置及び姿勢の推定結果の変化を抑制しつつ当該カメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、対応する参考画像データに関連付けられた位置情報に近接させる補正とが一度に行われることとなる。 Through this processing S12, the cost value related to the change in the estimation result of the position and orientation represented by SLAM coordinates in the reconstruction space (SLAM map) of the camera that has captured the moving image data, and scaling of the movement amount of the camera An optimization process is performed using a cost function that includes a cost value and a cost value related to the distance between the position of the camera and the position information associated with the reference image data, and the SLAM of the camera that captured moving image data Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates, and bringing the position and orientation of the camera close to position information associated with the corresponding reference image data The correction is performed at one time.
 位置推定装置1は、また、特徴点のキーフレームへの再投影誤差と、参考特徴点の参考画像データへの再投影誤差とが小さくなるように補正するバンドル調整処理を行う(S13)。これにより、動画像データを撮像したカメラの、各対象フレームでのワールド座標系で表される位置及び姿勢の推定結果が得られることとなる。 The position estimation device 1 also performs bundle adjustment processing to correct the reprojection error of the feature point to the key frame and the reprojection error of the reference feature point to the reference image data (S13). As a result, an estimation result of the position and orientation represented by the world coordinate system in each target frame of the camera that has captured the moving image data can be obtained.
 位置推定装置1は、ここまでの処理において得られている当該ワールド座標系でのカメラの位置及び姿勢の情報(推定の結果)を出力して(S14)、処理を終了する。 The position estimation device 1 outputs the information (result of estimation) of the position and orientation of the camera in the world coordinate system obtained in the processing up to this point (S14), and ends the processing.
[逐次的入力]
 なお、ここまでの説明では、位置推定装置1は、既に撮像が終了している動画像データを受け入れる場合を例として説明したが、本実施の形態はこれに限られない。
[Sequential input]
In the above description, the position estimation device 1 has been described as an example of receiving moving image data whose imaging has already been completed, but the present embodiment is not limited to this.
 本実施の形態の一例では、移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを、一部のフレーム(例えば1つのフレーム)が撮像されるごとに逐次的に受け入れることとしてもよい。このように動画像データを撮像しつつ処理する場合、逐次的な処理によりSLAMマップを生成する(つまり、当該逐次的に受け入れられるフレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付ける処理を行う)。このようなSLAMマップの生成処理は、インクリメンタルSfM(Structure from Motion)などとして広く知られているので、ここでの詳しい説明を省略する。 In an example of the present embodiment, moving image data including a series of frames obtained by imaging an object on a moving path while moving is sequentially taken each time a part of frames (for example, one frame) is imaged. It may be acceptable to When processing while capturing moving image data in this manner, the SLAM map is generated by sequential processing (that is, the feature points of the subject captured in the frame that can be sequentially received are extracted, and the feature points are extracted. Is associated with coordinates in the reconstruction space, which is a predetermined virtual three-dimensional space). Such SLAM map generation processing is widely known as incremental SfM (Structure from Motion) or the like, and thus detailed description thereof is omitted here.
 この例では、位置推定装置1は、その後の処理を逐次的に行う。すなわち位置推定装置1は、受け入れたフレームをキーフレームとするか否かを判断し、キーフレームとする場合に、当該キーフレームを対象フレームとして選択して、図4の処理S3以下の処理を行うこととすればよい。またこの場合、図4の処理S12,S13により、あるフレームについて推定されたカメラの位置と、当該フレームに対応する参考画像データに関連付けられた位置情報の表す位置との差が所定のしきい値を超える場合には、処理S10に戻って初期化の処理をやりなおしてもよい。 In this example, the position estimation device 1 sequentially performs the subsequent processing. That is, the position estimation device 1 determines whether or not the received frame is to be used as a key frame, and in the case of using it as a key frame, selects the key frame as an object frame and performs processing of step S3 and the following steps in FIG. You should do it. In this case, the difference between the position of the camera estimated for a certain frame by the processing S12 and S13 in FIG. 4 and the position represented by the position information associated with the reference image data corresponding to the frame is a predetermined threshold. In the case of exceeding S.sub.0, the process may return to the process S10 and perform the initialization process again.
 次に、位置情報に関連付けられた参考画像データ(以下の例ではジオタグ画像とする)を用いた、本発明の実施例とは異なる位置推定方法と、本発明の実施例に係る位置推定方法とを比較し、上記の本発明の実施の形態の位置推定装置1をコンピュータにより実装した、実施例の方法の精度を定量評価する。 Next, a position estimation method different from the embodiment of the present invention using reference image data (in the following example, a geotag image) associated with position information, and a position estimation method according to the embodiment of the present invention Are compared to quantitatively evaluate the accuracy of the method of the embodiment in which the position estimation device 1 according to the embodiment of the present invention described above is implemented by a computer.
 また、以下の実施例において、(1)初期化処理(線形変換:ILT)、(2)Sim(3)上での制約付きポーズグラフの最適化処理(PGO)、(3)特徴点のフレームへの再投影誤差と、参考特徴点の参考画像データへの再投影誤差とが小さくなるように補正するバンドル調整処理(BA)の、(1)から(3)の3段階の変形処理について、それぞれの処理の精度への影響及び有用性を検証する。 Further, in the following embodiments, (1) initialization processing (linear transformation: ILT), (2) optimization processing of constrained pose graph on Sim (3) (PGO), (3) frames of feature points About bundle modification processing (BA) which corrects so that the reprojection error to the reference image data of the re-projection error to the reference image data and the re-projection error to the reference image point becomes three steps of deformation processes (1) to (3) Examine the impact on accuracy of each process and its usefulness.
[データセット及び実装の詳細]
 以下の実施例では、The Ma′laga Stereoand Laser Urban Data Set(Ma′lagaデータセット)というスペインの都心部を長距離に渡って撮影した走行映像データセットを用いることとする。このMa′lagaデータセットの映像は解像度が1024×768、フレームレートが20fpsとなっている。本実施例においては、当該映像から2種類の映像(Video1,2)を切り出して評価に用いる。ここで当該2種類の映像が撮像している経路は、いずれもループを描かず、経路長は1km以上とした。また、全てのフレームには1秒ごとに取得されたGPSの位置情報が関連付けされているが、10m以上の誤差を含むフレームも含まれている。
[Data set and implementation details]
In the following embodiment, it is assumed that a traveling video data set taken over a long distance in the center of Spain called The Ma'laga Stereoand Laser Urban Data Set (Ma'laga data set) is used. The video of this Ma'laga data set has a resolution of 1024 × 768 and a frame rate of 20 fps. In the present embodiment, two types of video (Video 1 and 2) are cut out from the video and used for evaluation. Here, the path taken by the two types of images does not draw a loop, and the path length is 1 km or more. Further, although all the frames are associated with GPS position information acquired every one second, a frame including an error of 10 m or more is also included.
[評価指標]
 定量的な比較のため、後述の正解軌跡(Ground Truth:GT)上の複数の点の位置と、このGT上の点に対応するキーフレームを撮像したときのカメラの位置の、推定された位置と、GT(正解)の位置との距離(メートル単位)の平均(Ave)及び標準偏差(SD)を評価指標として用いることとする。ここで先に述べたように、データセットに予め付与されているGPSの位置情報にエラーが含まれているので、Google Street Viewの三次元地図から、手動で、いくつかのキーフレームにGT上の点の位置情報を設定した。
[Evaluation index]
For quantitative comparison, the estimated positions of the positions of a plurality of points on the correct trajectory (Ground Truth: GT) described later and the positions of the camera when capturing a key frame corresponding to the points on this GT The average (Ave) and the standard deviation (SD) of the distance (in meters) between the point and the position of the GT (correct answer) are used as evaluation indexes. As mentioned earlier here, the GPS location information pre-assigned to the data set contains errors, so from the 3D map of Google Street View, manually, on some keyframes on GT The position information of the point of was set.
[他の位置推定方法との比較]
 また、比較の対象とする他の位置推定方法としてT.Kroeger et.al.,“Video registration to sfm models,” ECCV, pp.1-16, 2014に記載の方法(以下Kroegerの方法と呼ぶ)と、S. P. Chang, et.al., “Extracting driving behavior : Global metric localization from dashcam videos in the wild,” ECCV, pp.136-148, 2016に記載の方法(以下Changの方法と呼ぶ)の2種類の方法と、本実施例との比較を行った。
[Comparison with other position estimation methods]
Also, as another position estimation method to be compared, the method described in T. Kroeger et. Al., “Video registration to sfm models,” ECCV, pp. 1-16, 2014 (hereinafter referred to as the Kroeger method) And SP Chang, et. Al., “Extracting driving behavior: Global metric localization from dashcam videos in the wild,” ECCV, pp. 136-148, 2016 (referred to as Chang method) The method of and the comparison with the present example were performed.
 これらKroegerの方法やChangの方法は、本実施例と異なりRGB-D(距離情報つきの)画像や3D点群の情報を用いるため、本実施例においてSLAMを用い、またワールド座標系とSLAM座標系(再構成された三次元の座標系)との間の対応を取得する部分は本実施例と同じ処理で置き換え、SLAMマップを変形する処理のみ比較することとした。Kroegerの方法は、スプライン平滑化を用いて、カメラ姿勢を平滑化する。ここでは当該方法の例として具体的に、キーフレームとの対応が取れたジオタグ画像(参考画像データ)のカメラ姿勢を、Cubic B-Splineを用いて平滑化する(Interpolation)こととした。 Unlike the Kroger's method and the Chang's method, since this embodiment uses RGB-D (with distance information) images and 3D point group information, the SLAM is used in the present embodiment, and the world coordinate system and the SLAM coordinate system are used. The portion for acquiring the correspondence with (the reconstructed three-dimensional coordinate system) is replaced with the same processing as the present embodiment, and only the processing for deforming the SLAM map is compared. The Kroeger method uses spline smoothing to smooth the camera pose. Here, as an example of the method, specifically, the camera posture of the geotag image (reference image data) corresponding to the key frame is smoothed (Interpolation) using Cubic B-Spline.
 また、Changの方法は、入力された動画像データから復元したSLAMマップを、Google Street Viewから取得した3D点群に一致させるよう、xz平面には二次元のアフィン変換を行い、y軸方向にはスケール変換を施す。ここでは具体的に、SLAMマップに、このChangの方法と同じ変換(Affine+)を施す方法を用いた。 In addition, Chang's method performs two-dimensional affine transformation on the xz plane so that the SLAM map restored from the input moving image data matches the 3D point group acquired from Google Street View, in the y-axis direction Performs scale conversion. Here, specifically, the method of applying the same conversion (Affine +) as the method of Chang to the SLAM map was used.
 これらの処理の結果を図5に示す。図5に示すように、本実施例の方法は、他の方法に比べてGTによく近接しており、他の方法に比べ大幅な精度向上を実現している。具体的には、Kroegerの方法では、疎なジオタグと映像の対応を、三次元的な構造を考慮せずに補完するため、対応点が十分存在しない場所の位置推定に大きな誤差が生じている。また、Changの方法では、映像の三次元的な構造を利用しているが、単純な線形変換を使用しているため、スケールドリフトによる歪みが生じた場合にその影響を大きく受け、十分な精度で位置推定できない場合がある。 The results of these processes are shown in FIG. As shown in FIG. 5, the method of the present embodiment is closer to GT than the other methods, and achieves a significant improvement in accuracy compared to the other methods. Specifically, since the Kroeger method complements the correspondence between sparse geotags and images without considering the three-dimensional structure, a large error occurs in the position estimation where there are not enough corresponding points. . In addition, Chang's method uses a three-dimensional structure of the image, but uses a simple linear transformation, so it is greatly affected by distortion due to scale drift, and the accuracy is sufficiently high. In some cases, the position can not be estimated.
[3段階の変形の影響]
 また、本実施例において行われる3段階の変形について、それぞれの段階の処理が精度にどの程度影響しているかを調べた。具体的には、3つの変形処理の少なくとも一部を様々な組み合わせで用いて、Video1に基づく位置推定処理を行った。この結果を図6に示す。図6において、「*」は意味のある値が得られなかったことを示す。図6からわかるように、全ての変形手法を適応した場合(#5)に最も高精度な位置推定が実現されている。図6に例示した#1から#5の処理結果を、実際の地図上にプロットすると、ポーズグラフ最適化(PGO)を施さない場合(#2,#3)では、バンドル調整(BA)を施しても100メートルほどの誤差が生じている箇所が見られた。一方、ポーズグラフ最適化(PGO)を施した場合(#4,#5)では、スケールドリフトが解消され、その結果、最終的にバンドル調整(BA)が適切に働いているのが確認できた。
[Influence of three stages of deformation]
In addition, with respect to the three-step deformation performed in the present embodiment, it was examined how much the processing of each step affected the accuracy. Specifically, the position estimation process based on Video 1 was performed using at least a part of the three deformation processes in various combinations. The results are shown in FIG. In FIG. 6, "*" indicates that a meaningful value was not obtained. As can be seen from FIG. 6, the most accurate position estimation is realized when all deformation methods are applied (# 5). When processing results of # 1 to # 5 illustrated in FIG. 6 are plotted on an actual map, bundle adjustment (BA) is applied when pose graph optimization (PGO) is not applied (# 2, # 3) Even in places where an error of about 100 meters had occurred. On the other hand, when pose graph optimization (PGO) was applied (# 4, # 5), scale drift was eliminated, and as a result, it was confirmed that bundle adjustment (BA) was working properly in the end. .
 これより、少なくともバンドル調整を適切に行うには、ポーズグラフ最適化を施す必要があることがわかった。 From this, it was found that at least the bundle adjustment needs to be subjected to pose graph optimization.
[おわりに]
 本実施例では、三次元復元(SLAM)の結果を利用し、ジオタグ画像と撮像された動画像データとの疎な対応から、動画像データ中の全フレームのそれぞれを撮像したときのカメラの絶対位置(ワールド座標系での位置、つまり、緯度経度に関連する情報)を推定した。この実施例で用いた動画像データのように、1kmを超える長距離の経路上で撮像され、かつ、その経路の軌跡がループを描かない場合(同一地点に戻ることのない場合)は、三次元復元の結果に、実スケールで数十メートルほどの誤差が生じる場合がある。
[in conclusion]
In this embodiment, the absolute correspondence of the camera when each frame of moving image data is captured from the sparse correspondence between the geotag image and the captured moving image data using the result of three-dimensional reconstruction (SLAM) The position (the position in the world coordinate system, that is, information related to the latitude and longitude) was estimated. When moving image data used in this embodiment is imaged on a long distance route exceeding 1 km and the trajectory of the route does not draw a loop (when it does not return to the same point), the third order An error of several tens of meters may occur on the actual scale in the result of the original restoration.
 これは、一般的なSLAM等、動画像データからの三次元復元では、撮像時の移動軌跡がループを描いた(同一の場所を再度観測した)場合にのみしか、誤差の蓄積を改善できないためである。 This is because in general three-dimensional reconstruction from moving image data such as SLAM, error accumulation can be improved only when the movement trajectory at the time of imaging draws a loop (the same place is observed again). It is.
 本実施の形態では、ジオタグ画像と撮像された動画像データとの疎な対応(一部フレームのみの関連付け)を利用して、スケールドリフトを改善する処理を統合し、絶対位置の推定に対して三次元復元の構造を適切に利用することが可能となった。 In this embodiment, processing for improving scale drift is integrated using sparse correspondence (association of only partial frames) between a geotag image and captured moving image data, and estimation of absolute position is performed. It has become possible to properly use the three-dimensional reconstruction structure.
 なお、参考画像データに関連付けられた緯度経度等の位置の情報に含まれる誤差が小さいほど、精度は向上すると考えられる。また、撮像された動画像データに含まれるフレーム中の特徴点と、それに対応する参考画像データに含まれる参考特徴点とが、それぞれの画像を撮像した位置から遠距離に存在し、また少数であると、特徴点同士のマッチングの精度が十分でなくなる。またカメラが車道(移動軌跡)に平行な方向を向いていることや、カメラの画角が狭い場合に、参考画像データを撮像したカメラの位置や姿勢を推定する際にカメラから特徴点群への方向の誤差が比較的大きくなる。この誤差は主に車道に並行な向きに生じる。 The accuracy is considered to be improved as the error included in the position information such as the latitude and longitude associated with the reference image data is smaller. Also, the feature points in the frame included in the captured moving image data and the reference feature points included in the corresponding reference image data exist at a long distance from the position at which each image was captured, and a small number of them If there is, the matching accuracy between feature points will not be sufficient. In addition, when the camera points in a direction parallel to the roadway (moving locus), or when the angle of view of the camera is narrow, the camera moves from the camera to the feature point group when estimating the position and orientation of the camera that captured the reference image data. The error in the direction of is relatively large. This error mainly occurs in the direction parallel to the roadway.
 そこで特徴点のマッチング方法の精度をより向上するか、及び/または、全天球カメラ等のデバイスを用いて、移動軌跡に直交する方向の画像を含めて処理を行うことが好適である。 Therefore, it is preferable to further improve the accuracy of the feature point matching method and / or to perform processing including an image in a direction orthogonal to the movement trajectory using a device such as an omnidirectional camera.
 1 位置推定装置、11 制御部、12 記憶部、13 操作部、14 表示部、15 通信部、16 インタフェース部、21 受入部、22 SLAM処理部、23 参考画像データ取得部、24 検索部、25 変換関係取得部、26 変換処理部。
 
Reference Signs List 1 position estimation apparatus, 11 control unit, 12 storage unit, 13 operation unit, 14 display unit, 15 communication unit, 16 interface unit, 21 reception unit, 22 SLAM processing unit, 23 reference image data acquisition unit, 24 search unit, 25 Conversion relationship acquisition unit, 26 conversion processing unit.

Claims (6)

  1.  移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを受け入れる受入手段と、
     前記受け入れた動画像データに含まれるフレームの少なくとも一つを対象フレームとし、当該対象フレームに撮像された被写体が少なくとも一つ撮像されている参考画像データであって、予め撮像した位置を表す、所定の世界座標系での位置情報が関連付けられている参考画像データを取得する取得手段と、
     前記動画像データに含まれる各フレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付けた再構成マップを生成する再構成手段と、
     前記参考画像データにおいて当該特徴点に対応する参考特徴点を検索する検索手段と、
     前記検索された参考特徴点と、前記参考画像データに関連付けられた位置情報と、前記特徴点に関連付けられた座標とに基づいて、当該位置情報の座標系である世界座標系と、前記特徴点に関連付けられた座標の座標系との変換関係を取得する関係取得手段と、
     前記変換関係を用いて、前記再構成マップを補正する変換手段と、
    を含み、
     前記変換手段が、前記動画像データを撮像したカメラの前記再構成空間内の座標で表される位置及び姿勢の推定結果の変化を抑制しつつカメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、前記参考画像データに関連付けられた位置情報に近接させる補正とを行って、前記再構成マップを補正する位置推定装置。
    Receiving means for receiving moving image data including a series of frames obtained by imaging an object on a moving path while moving;
    Reference image data in which at least one of the frames included in the received moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined image representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the world coordinate system of
    A feature map of the subject captured in each frame included in the moving image data is extracted, and a feature map is associated with coordinates in a reconstruction space, which is a predetermined virtual three-dimensional space. A reconstruction means to generate;
    Search means for searching for reference feature points corresponding to the feature points in the reference image data;
    A world coordinate system, which is a coordinate system of the position information, based on the retrieved reference feature points, position information associated with the reference image data, and coordinates associated with the feature points, and the feature points Relationship acquisition means for acquiring a conversion relationship with the coordinate system of the coordinates associated with
    Conversion means for correcting the reconstruction map using the conversion relationship;
    Including
    Correction for scaling the amount of movement of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data; A position estimation device that corrects the reconstruction map by performing a correction that brings a position and a posture into proximity with position information associated with the reference image data.
  2.  請求項1に記載の位置推定装置であって、
     前記変換手段は、前記動画像データを撮像したカメラの前記再構成空間内の座標で表される位置及び姿勢の推定結果の変化に係るコスト値と、カメラの移動量のスケーリングに係るコスト値と、当該カメラの位置と前記参考画像データに関連付けられた位置情報との距離に係るコスト値とを含むコスト関数を用いて、前記動画像データを撮像したカメラの前記再構成空間内の座標で表される位置及び姿勢の推定結果の変化を抑制しつつカメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、前記参考画像データに関連付けられた位置情報に近接させる補正とを一度に行う位置推定装置。
    The position estimation device according to claim 1, wherein
    The conversion means includes a cost value relating to a change in an estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera which has captured the moving image data, and a cost value relating to scaling of the movement amount of the camera. Using a cost function including a cost value relating to the distance between the position of the camera and the position information associated with the reference image data, a table of coordinates in the reconstruction space of the camera that captured the moving image data Correction to scale the amount of movement of the camera while suppressing changes in estimated results of the position and orientation, and correction to bring the position and orientation of the camera close to position information associated with the reference image data at one time Position estimation device to do.
  3.  請求項1または2に記載の位置推定装置であって、
     前記変換手段は、前記動画像データを撮像したカメラの前記再構成空間内の座標で表される位置及び姿勢の推定結果の変化を抑制しつつカメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、前記参考画像データに関連付けられた位置情報に近接させる補正とを行った後に、さらに、前記特徴点の前記フレームへの再投影誤差と、前記参考特徴点の前記参考画像データへの再投影誤差とが小さくなるように補正するバンドル調整処理を行う位置推定装置。
    The position estimation device according to claim 1 or 2, wherein
    The conversion means scales the movement amount of the camera while suppressing the change in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera which has captured the moving image data, and After correcting the position and orientation to be close to the position information associated with the reference image data, the reprojection error of the feature point to the frame and the reference image data of the reference feature point are further added. A position estimation device that performs bundle adjustment processing for correcting so as to reduce the reprojection error of.
  4.  請求項1から3のいずれか一項に記載の位置推定装置であって、
     前記受入手段は、移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを、一部のフレームが撮像されるごとに逐次的に受け入れ、
     前記再構成手段は、当該逐次的に受け入れられるフレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付ける位置推定装置。
    The position estimation device according to any one of claims 1 to 3, wherein
    The receiving means sequentially receives moving image data including a series of frames obtained by imaging a subject on a moving path while moving, each time a part of the frames is imaged,
    The reconstruction means extracts a feature point related to a subject imaged in the sequentially accepted frame, and associates the feature point with coordinates in a reconstruction space which is a predetermined virtual three-dimensional space Estimator.
  5.  コンピュータを用い、
     当該コンピュータが、
     移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを受け入れ、
     前記受け入れた動画像データに含まれるフレームの少なくとも一つを対象フレームとし、当該対象フレームに撮像された被写体が少なくとも一つ撮像されている参考画像データであって、予め撮像した位置を表す、所定の世界座標系での位置情報が関連付けられている参考画像データを取得し、
     前記動画像データに含まれる各フレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付けた再構成マップを生成し、
     前記参考画像データにおいて当該特徴点に対応する参考特徴点を検索し、
     前記検索された参考特徴点と、前記参考画像データに関連付けられた位置情報と、前記特徴点に関連付けられた座標とに基づいて、当該位置情報の座標系である世界座標系と、前記特徴点に関連付けられた座標の座標系との変換関係を取得し、
     前記動画像データを撮像したカメラの前記再構成空間内の座標で表される位置及び姿勢の推定結果の変化を抑制しつつカメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、前記参考画像データに関連付けられた位置情報に近接させる補正とを行って、前記再構成マップを補正する位置推定方法。
    Using a computer
    The computer
    Accept moving image data including a series of frames obtained by imaging an object on a moving route while moving
    Reference image data in which at least one of the frames included in the received moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined image representing a position captured in advance Get reference image data associated with the position information in the world coordinate system of
    A feature map of the subject captured in each frame included in the moving image data is extracted, and a feature map is associated with coordinates in a reconstruction space, which is a predetermined virtual three-dimensional space. Generate
    Searching for a reference feature point corresponding to the feature point in the reference image data;
    A world coordinate system, which is a coordinate system of the position information, based on the retrieved reference feature points, position information associated with the reference image data, and coordinates associated with the feature points, and the feature points Get the transformation relation with the coordinate system of the coordinates associated with
    Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data, and the position and orientation of the camera, A position estimation method for correcting the reconstruction map by performing correction for bringing the position information associated with the reference image data into proximity.
  6.  コンピュータを、
     移動しつつ移動経路上の被写体を撮像して得た一連のフレームを含む動画像データを受け入れる受入手段と、
     前記受け入れた動画像データに含まれるフレームの少なくとも一つを対象フレームとし、当該対象フレームに撮像された被写体が少なくとも一つ撮像されている参考画像データであって、予め撮像した位置を表す、所定の世界座標系での位置情報が関連付けられている参考画像データを取得する取得手段と、
     前記動画像データに含まれる各フレームに撮像された被写体に係る特徴点を抽出し、当該特徴点を、所定の仮想的な三次元空間である再構成空間内の座標に関連付けた再構成マップを生成する再構成手段と、
     前記参考画像データにおいて当該特徴点に対応する参考特徴点を検索する検索手段と、
     前記検索された参考特徴点と、前記参考画像データに関連付けられた位置情報と、前記特徴点に関連付けられた座標とに基づいて、当該位置情報の座標系である世界座標系と、前記特徴点に関連付けられた座標の座標系との変換関係を取得する関係取得手段と、
     前記動画像データを撮像したカメラの前記再構成空間内の座標で表される位置及び姿勢の推定結果の変化を抑制しつつカメラの移動量をスケーリングする補正と、当該カメラの位置及び姿勢を、前記参考画像データに関連付けられた位置情報に近接させる補正とを行って、前記再構成マップを補正する変換手段と、
    として機能させるプログラム。

     
    Computer,
    Receiving means for receiving moving image data including a series of frames obtained by imaging an object on a moving path while moving;
    Reference image data in which at least one of the frames included in the received moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined image representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the world coordinate system of
    A feature map of the subject captured in each frame included in the moving image data is extracted, and a feature map is associated with coordinates in a reconstruction space, which is a predetermined virtual three-dimensional space. A reconstruction means to generate;
    Search means for searching for reference feature points corresponding to the feature points in the reference image data;
    A world coordinate system, which is a coordinate system of the position information, based on the retrieved reference feature points, position information associated with the reference image data, and coordinates associated with the feature points, and the feature points Relationship acquisition means for acquiring a conversion relationship with the coordinate system of the coordinates associated with
    Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates in the reconstruction space of the camera that has captured the moving image data, and the position and orientation of the camera, Converting means for correcting the reconstruction map by performing correction to bring the position information associated with the reference image data into proximity.
    A program to function as

PCT/JP2018/023697 2017-06-21 2018-06-21 Position estimating device, position estimating method, and program WO2018235923A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762522857P 2017-06-21 2017-06-21
US62/522,857 2017-06-21

Publications (1)

Publication Number Publication Date
WO2018235923A1 true WO2018235923A1 (en) 2018-12-27

Family

ID=64737652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/023697 WO2018235923A1 (en) 2017-06-21 2018-06-21 Position estimating device, position estimating method, and program

Country Status (1)

Country Link
WO (1) WO2018235923A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110106755A (en) * 2019-04-04 2019-08-09 武汉大学 Utilize the uneven pliable detection method of the high-speed railway rail of attitude reconstruction rail geometric shape
CN110570473A (en) * 2019-09-12 2019-12-13 河北工业大学 weight self-adaptive posture estimation method based on point-line fusion
CN111882494A (en) * 2020-06-28 2020-11-03 广州文远知行科技有限公司 Pose graph processing method and device, computer equipment and storage medium
CN112819970A (en) * 2021-02-19 2021-05-18 联想(北京)有限公司 Control method and device and electronic equipment
JP2021082181A (en) * 2019-11-22 2021-05-27 パナソニックIpマネジメント株式会社 Position estimation device, vehicle, position estimation method and position estimation program
CN113781550A (en) * 2021-08-10 2021-12-10 国网河北省电力有限公司保定供电分公司 Four-foot robot positioning method and system
CN113779012A (en) * 2021-09-16 2021-12-10 中国电子科技集团公司第五十四研究所 Monocular vision SLAM scale recovery method for unmanned aerial vehicle
WO2022118061A1 (en) * 2020-12-04 2022-06-09 Hinge Health, Inc. Object three-dimensional localizations in images or videos
CN115700507A (en) * 2021-07-30 2023-02-07 北京小米移动软件有限公司 Map updating method and device
JP2023529786A (en) * 2021-05-07 2023-07-12 テンセント・アメリカ・エルエルシー A method for estimating inter-camera pose graphs and transformation matrices by recognizing markers on the ground in panoramic images

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013535013A (en) * 2010-06-25 2013-09-09 トリンブル ナビゲーション リミテッド Method and apparatus for image-based positioning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013535013A (en) * 2010-06-25 2013-09-09 トリンブル ナビゲーション リミテッド Method and apparatus for image-based positioning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHANG, SHAOPIN ET AL.: "Extracting Driving Behavior: Global Metric Localization from Dashcam Video in the Wild", COMPUTER VISION - ECCV 2016 WORKSHOPS, 2016, pages 136 - 148, XP055565003 *
IWAMI, KAZUYA ET AL.: "Global Metric Localization and Correction of Scale Drift using Street View", IEICE TECHNICAL REPORT, vol. 117, no. 106, 15 June 2017 (2017-06-15), pages 69 - 74 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110106755A (en) * 2019-04-04 2019-08-09 武汉大学 Utilize the uneven pliable detection method of the high-speed railway rail of attitude reconstruction rail geometric shape
CN110106755B (en) * 2019-04-04 2020-11-03 武汉大学 Method for detecting irregularity of high-speed rail by reconstructing rail geometric form through attitude
CN110570473A (en) * 2019-09-12 2019-12-13 河北工业大学 weight self-adaptive posture estimation method based on point-line fusion
JP2021082181A (en) * 2019-11-22 2021-05-27 パナソニックIpマネジメント株式会社 Position estimation device, vehicle, position estimation method and position estimation program
WO2021100650A1 (en) * 2019-11-22 2021-05-27 パナソニックIpマネジメント株式会社 Position estimation device, vehicle, position estimation method and position estimation program
CN111882494A (en) * 2020-06-28 2020-11-03 广州文远知行科技有限公司 Pose graph processing method and device, computer equipment and storage medium
CN111882494B (en) * 2020-06-28 2024-05-14 广州文远知行科技有限公司 Pose graph processing method and device, computer equipment and storage medium
WO2022118061A1 (en) * 2020-12-04 2022-06-09 Hinge Health, Inc. Object three-dimensional localizations in images or videos
CN112819970A (en) * 2021-02-19 2021-05-18 联想(北京)有限公司 Control method and device and electronic equipment
CN112819970B (en) * 2021-02-19 2023-12-26 联想(北京)有限公司 Control method and device and electronic equipment
JP2023529786A (en) * 2021-05-07 2023-07-12 テンセント・アメリカ・エルエルシー A method for estimating inter-camera pose graphs and transformation matrices by recognizing markers on the ground in panoramic images
CN115700507A (en) * 2021-07-30 2023-02-07 北京小米移动软件有限公司 Map updating method and device
CN115700507B (en) * 2021-07-30 2024-02-13 北京小米移动软件有限公司 Map updating method and device
CN113781550A (en) * 2021-08-10 2021-12-10 国网河北省电力有限公司保定供电分公司 Four-foot robot positioning method and system
CN113779012B (en) * 2021-09-16 2023-03-07 中国电子科技集团公司第五十四研究所 Monocular vision SLAM scale recovery method for unmanned aerial vehicle
CN113779012A (en) * 2021-09-16 2021-12-10 中国电子科技集团公司第五十四研究所 Monocular vision SLAM scale recovery method for unmanned aerial vehicle

Similar Documents

Publication Publication Date Title
WO2018235923A1 (en) Position estimating device, position estimating method, and program
US10269147B2 (en) Real-time camera position estimation with drift mitigation in incremental structure from motion
Li et al. DeepI2P: Image-to-point cloud registration via deep classification
US11176701B2 (en) Position estimation system and position estimation method
US10269148B2 (en) Real-time image undistortion for incremental 3D reconstruction
US6587601B1 (en) Method and apparatus for performing geo-spatial registration using a Euclidean representation
US9396583B2 (en) Method of modelling buildings on the basis of a georeferenced image
US20170236284A1 (en) Registration of aerial imagery to vector road maps with on-road vehicular detection and tracking
CN111862126A (en) Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
WO2023045455A1 (en) Non-cooperative target three-dimensional reconstruction method based on branch reconstruction registration
US20170092015A1 (en) Generating Scene Reconstructions from Images
CN112750203B (en) Model reconstruction method, device, equipment and storage medium
Tao et al. Automated localisation of Mars rovers using co-registered HiRISE-CTX-HRSC orthorectified images and wide baseline Navcam orthorectified mosaics
CN114565863B (en) Real-time generation method, device, medium and equipment for orthophoto of unmanned aerial vehicle image
CN112197764A (en) Real-time pose determining method and device and electronic equipment
CN112132754B (en) Vehicle movement track correction method and related device
Patil et al. A new stereo benchmarking dataset for satellite images
Zhao et al. RTSfM: Real-time structure from motion for mosaicing and DSM mapping of sequential aerial images with low overlap
CN111553845A (en) Rapid image splicing method based on optimized three-dimensional reconstruction
CN110570474A (en) Pose estimation method and system of depth camera
US20240161392A1 (en) Point cloud model processing method and apparatus, and readable storage medium
Suliman et al. Development of line-of-sight digital surface model for co-registering off-nadir VHR satellite imagery with elevation data
Zhao et al. Fast georeferenced aerial image stitching with absolute rotation averaging and planar-restricted pose graph
WO2023017663A1 (en) Systems and methods for image processing based on optimal transport and epipolar geometry
Fu-Sheng et al. Batch reconstruction from UAV images with prior information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18819744

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18819744

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP