WO2018235923A1 - Dispositif, procédé et programme d'estimation de position - Google Patents

Dispositif, procédé et programme d'estimation de position Download PDF

Info

Publication number
WO2018235923A1
WO2018235923A1 PCT/JP2018/023697 JP2018023697W WO2018235923A1 WO 2018235923 A1 WO2018235923 A1 WO 2018235923A1 JP 2018023697 W JP2018023697 W JP 2018023697W WO 2018235923 A1 WO2018235923 A1 WO 2018235923A1
Authority
WO
WIPO (PCT)
Prior art keywords
image data
camera
reference image
coordinate system
coordinates
Prior art date
Application number
PCT/JP2018/023697
Other languages
English (en)
Japanese (ja)
Inventor
清晴 相澤
和也 石見
Original Assignee
国立大学法人 東京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人 東京大学 filed Critical 国立大学法人 東京大学
Publication of WO2018235923A1 publication Critical patent/WO2018235923A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Definitions

  • the present invention relates to a position estimation device, a position estimation method, and a program.
  • GIS geographical information system
  • the method of estimating the absolute position of the image using the image with the geotag is a position with relatively high accuracy if the correspondence between the images can be sufficiently obtained compared to the method using the road network or the satellite image
  • estimation is possible.
  • the correspondence between the input traveling video and the geotagged image may be difficult due to the influence of the change of the illumination environment or the angle of view taken. If this association is not appropriate, position estimation can not be performed, or estimation with the required accuracy can not be performed.
  • the present invention has been made in view of such actual circumstances, and for each frame image of a traveling image including a place where it is not possible to directly associate with a geotagged image, information of corresponding positions is provided. It is an object of the present invention to provide a position estimation device, a position estimation method, and a program that can be set.
  • Non-Patent Document 1 three-dimensional information of a three-dimensional object captured in a traveling image is estimated (reconstructed) by SLAM (Simultaneous Localization and Mapping), and a three-dimensional reconstruction map obtained by this estimation is used. It is disclosed that the position estimation of the traveling image is performed by deforming it as an object in the world coordinate system, but in the method of this non-patent document 1, the error accumulates as the traveling distance becomes longer, and the position It is known that the estimation accuracy is reduced.
  • SLAM Simultaneous Localization and Mapping
  • the present invention for solving the problems of the prior art is a position estimation apparatus, which receives moving image data including a series of frames obtained by imaging an object on a moving path while moving; Reference image data in which at least one of the frames included in the moving image data is a target frame, and at least one subject imaged in the target frame is imaged, and is a predetermined world representing a position captured in advance Acquisition means for acquiring reference image data associated with position information in the coordinate system, feature points related to the subject imaged in each frame included in the moving image data, and extracting the feature points Reconstruction means for generating a reconstruction map associated with coordinates in a reconstruction space which is a virtual three-dimensional space; and corresponding to the feature point in the reference image data
  • the coordinate system of the position information based on the search means for searching the reference feature point, the searched reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point
  • a relationship acquiring unit that acquires a conversion relationship between the world coordinate system and the coordinate system of the coordinates associated with the
  • the position and orientation of the image are corrected to be close to the position information associated with the reference image data to correct the reconstruction map.
  • information of corresponding positions can be set for each frame image of a traveling video including a place where it is not possible to directly associate with a geotagged image.
  • the position estimation device 1 is, as illustrated in FIG. 1, a control unit 11, a storage unit 12, an operation unit 13, a display unit 14, a communication unit 15, and an interface unit 16. And including.
  • the control unit 11 is a program control device such as a CPU, and operates in accordance with a program stored in the storage unit 12.
  • the control unit 11 receives moving image data including a series of frames obtained by capturing an object on a moving path while moving.
  • the control unit 11 is reference image data in which at least one subject imaged in the target frame is imaged, with at least one of the frames included in the moving image data received here as the target frame, and is imaged in advance Reference image data associated with position information in world coordinates representing the position is acquired.
  • control unit 11 extracts feature points related to the subject captured in each frame included in the received moving image data, and coordinates the feature points in the reconstruction space, which is a predetermined virtual three-dimensional space. While processing as SLAM (Simultaneous Localization and Mapping) to be associated with, the reference feature point corresponding to the feature point is searched in the reference image data.
  • SLAM Simultaneous Localization and Mapping
  • the control unit 11 sets the retrieved reference feature point, the position information associated with the reference image data, and the coordinates associated with the feature point (the coordinate system of this feature point, that is, the coordinate system in the reconstruction space is Based on the SLAM coordinate system, the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system is acquired.
  • the control unit 11 then converts the coordinates in the SLAM coordinate system associated with the feature points imaged in each frame into the values of the coordinates in the world coordinate system, using this conversion relationship.
  • the world coordinate system is represented by a three-dimensional coordinate system (x, y, z)
  • the (x, z) plane is a universal horizontal Mercator (UTM) orthogonal coordinate which is an orthogonal plane coordinate system in meters. It corresponds to the system.
  • the y-axis corresponds to the altitude from the ground plane (also in meters).
  • values on the UTM Cartesian coordinate system can be converted to coordinate values of latitude and longitude pairs.
  • the control unit 11 when acquiring the conversion relationship between the world coordinate system, which is the coordinate system of the position information, and the SLAM coordinate system in this embodiment, the control unit 11 in the SLAM coordinate system of the camera which captured moving image data.
  • a correction for scaling the movement amount of the camera while suppressing a change in the estimation result of the position and orientation represented by the coordinates, and a correction for bringing the position and orientation of the camera close to the position information associated with the reference image data Then, using the coordinates of the feature point represented by the coordinates of the SLAM coordinate system after correction and the values of the coordinates of the corresponding reference feature point in the world coordinate system, the world coordinate system and the SLAM coordinate system Get conversion relation.
  • the detailed operation of the control unit 11 will be described later.
  • the storage unit 12 is a memory device, a disk device, or the like, and holds a program executed by the control unit 11.
  • the storage unit 12 also operates as a work memory of the control unit 11.
  • the operation unit 13 is a keyboard, a mouse, or the like, receives an instruction operation of the user, and outputs the content of the instruction operation to the control unit 11.
  • the display unit 14 is a display or the like, and displays and outputs information in accordance with an instruction input from the control unit 11.
  • the communication unit 15 is a network interface or the like, and transmits information such as a request via the network in accordance with an instruction input from the control unit 11.
  • the communication unit 15 also outputs the information received via the network to the control unit 11.
  • the communication unit 15 is used, for example, when acquiring a reference image such as a geotag image from a server on the Internet.
  • the interface unit 16 is, for example, a USB interface or the like, and outputs moving image data input from a camera or the like to the control unit 11.
  • the control unit 11 is functionally illustrated in FIG. 2 as an acceptance unit 21, a SLAM processing unit 22, a reference image data acquisition unit 23, a search unit 24, and a conversion relationship acquisition unit. 25 and the conversion processing unit 26.
  • the receiving unit 21 receives moving image data including a series of frames obtained by capturing an object on a moving path with the camera while moving the camera.
  • This camera may be a monocular camera (that is, a camera that does not acquire information in the depth direction), and therefore, it is assumed that frames included in captured moving image data do not include depth information.
  • control unit 11 restores a three-dimensional map by SLAM using the received moving image data, and deforms the obtained three-dimensional restoration map into a world coordinate system.
  • position information is associated with all frames of moving image data, including frames obtained by imaging locations such as geotags that can not be directly associated with an image associated with position information related to the world coordinate system.
  • the scale drift is improved in consideration of a problem that a scale error gradually accumulates during the restoration processing (scale drift problem). While mapping to the world coordinate system.
  • the SLAM processing unit 22 extracts a feature point related to the subject captured in each frame included in the moving image data, and associates the feature point with the coordinates in the reconstructed three-dimensional space (SLAM process). Run. Then, the SLAM processing unit 22 issues unique feature point identification information for each feature point, the feature point identification information, information for identifying a frame from which the feature point is extracted, and the three-dimensional space of the feature point. Are associated with each other and stored in the storage unit 12. For example, this process is described in ORB-SLAM (R. Mur-Atral, et. Al., "ORB-SLAM: a versatile and accurate monocular slam system", IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147). -1, 2015, etc., and the contents of the specific processing are widely known, so the description here is omitted.
  • SLAM map information in which the feature point identification information generated here is associated with the coordinate values in the coordinate system (SLAM coordinate system) of the virtual three-dimensional space.
  • a SLAM map information in which the feature point identification information generated here is associated with the coordinate values in the coordinate system (SLAM coordinate system) of the virtual three-dimensional space.
  • KF key frame
  • Cfp-kf information in which values of three-dimensional coordinates in the SLAM coordinate system in the SLAM map are associated.
  • the SLAM processing unit 22 stores the information Cfp-kf in the storage unit 12.
  • the reference image data acquisition unit 23 also selects at least one of the frames included in the received moving image data as a target frame. This selection may be performed artificially, or the frame selected as a key frame in the reconstruction processing of the three-dimensional space in the SLAM processing unit 22 may be used as the target frame as it is.
  • the reference image data acquisition unit 23 is reference image data in which at least one subject common to the subject captured in the selected target frame is captured, and positional information in world coordinates representing a position captured in advance is associated Get the reference image data that is being
  • moving image data to be received is captured by a camera mounted on a vehicle or the like moving along a road.
  • the reference image data can be searched, for example, from Google Street View (https://www.google.com/streetview/) of Google Inc. in the United States.
  • Google Street View is a searchable GIS (Geographic Information System) for the streets, and is one of the large geotag image datasets for countries around the world.
  • GIS Geographic Information System
  • all geotag images are given as information in which a panoramic image and latitude and longitude information are associated.
  • the panoramic image is cut out in eight horizontal directions at the same angle of view as the target frame selected from the received moving image data, and used as a geotag image group.
  • the reference image data does not have to be acquired from Google Street View.
  • the route includes an intersection
  • image data other than image data extracted from moving image data position information (latitude and longitude information) of the intersection. This can be used as a geotag image.
  • the moving route of the moving image data to be accepted may not necessarily be outdoors, and may be, for example, a route moving in a facility such as a store.
  • the reference image data image data captured separately (apart from moving image data) on the moving path and coordinate values in the world coordinate system appropriately set in the facility are used.
  • the world coordinate system in this case, when the facility is viewed from above, a specific point in the facility is taken as the origin, and the positive direction of the X axis to the east and the positive direction of the Y axis to the north
  • the value of (X, Y) in the unit of metric may be set as the value of the orthogonal coordinate system.
  • the reference image data acquisition unit 23 inputs, for example, information (for example, a city name, a town name, or within a specified distance (for example, 400 meters) from a specified point) specifying an area where the received moving image data is captured. Receiving the geotagging image associated with the latitude and longitude information in the input area from the server of Google Street View, and each of the geotagging images with the target frame selected from the received moving image data At the same angle of view, cut out in eight horizontal directions to form a geotag image group.
  • information for example, a city name, a town name, or within a specified distance (for example, 400 meters) from a specified point
  • the reference image data acquisition unit 23 selects k geotag images similar to the selected target frame in descending order of the degree of similarity as reference image data.
  • the degree of similarity is a method of using a bag-of-words approach using SIFT feature quantities (Agarwal, W. Burgard, and L. Spinello, “Metric localization using google street view,” IROS, pp. 3111-3118 , 2015.) etc. may be adopted.
  • the reference image data acquisition unit 23 displays and outputs the selected target frame, and, for example, Google Street View provides the user with a geotag image in the vicinity of the target frame that has been displayed and output.
  • the geotag image may be selected, and the selected geotag image may be acquired as reference image data.
  • the search unit 24 acquires a set of corresponding feature points from each of the target frame selected by the reference image data acquisition unit 23 and the reference image data acquired by the reference image data acquisition unit 23. Specifically, the search unit 24 determines ORB feature points (E.Rublee, V.Rabaud, K.Konolige, and G.Bradski, “Orb: An effective alternative to sift” from the target frame and the reference image data, respectively). or surf, see ICCV, pp. 2564-2571, 2011). Then, matching of feature points detected from each of the target frame and the reference image data is performed.
  • ORB feature points E.Rublee, V.Rabaud, K.Konolige, and G.Bradski, “Orb: An effective alternative to sift” from the target frame and the reference image data, respectively. or surf, see ICCV, pp. 2564-2571, 2011.
  • the search unit 24 remove a wrongly matched matching by using a VLD (Virtual Line Descriptor) or the like. It is.
  • VLD Virtual Line Descriptor
  • the details of VLD are disclosed in Z. Liu and R. Marlet, “Virtual line descriptor and semilocal matching method for reliable feature correspondence,” BMVC, pp. 16-1, 2012, and are widely known. Detailed explanation here is omitted.
  • the search unit 24 further obtains a plurality of reference image data corresponding to one target frame (as described above, for example, when k sheets are selected in descending order of similarity).
  • a predetermined threshold for example, 5
  • the search unit 24 excludes the target frame from selection (does not set it as a target frame).
  • the conversion relation acquisition unit 25 estimates a conversion relation CSLAM-World between the SLAM coordinate system and the world coordinate system. Specifically, the conversion relationship acquiring unit 25 first estimates the posture of the camera that has captured the reference image data. That is, the conversion relationship acquiring unit 25 sets the SLAM map in the corresponding SLAM coordinate system for each of the feature points included in the SLAM map for which the corresponding reference feature point is found in the reference image data by the search unit 24. Get the value of the coordinate of the feature point of. In addition, the conversion relationship acquisition unit 25 receives information on the position of reference feature points in the reference image data corresponding to each feature point (information on a two-dimensional position in the reference image data) obtained by the search unit 24. Information Cmap-geo in which the value of each feature point in the SLAM coordinate system is associated with the information on the position of the corresponding reference feature point in the reference image data is obtained.
  • the conversion relationship acquisition unit 25 is a reprojection error when the Cmap-geo is reprojected onto the reference image data in the orientation information (information of six degrees of freedom) of the camera that captured the reference image data in the SLAM coordinate system. You get by minimizing This minimization is performed, for example, using the Levenberg-Marquardt method.
  • the conversion relationship processing unit 25 acquires a set of information on the posture of the camera that captured the reference image data in the SLAM coordinate system and the world coordinates associated with the reference image data, and based on these, a widely known method The transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system is obtained.
  • the transformation processing unit 26 transforms the SLAM map using the transformation relationship CSLAM-World between the SLAM coordinate system and the world coordinate system.
  • the conversion processing unit 26 sequentially performs processing of initialization processing, pose graph optimization, and bundle adjustment.
  • the initialization process the following linear transformation is sequentially performed on the entire SLAM map obtained up to the time of processing using the correspondence relationship between the SLAM coordinate system of the SLAM map and the world coordinate system, CSLAM-World, This is a process that roughly matches the world coordinate system.
  • This initialization process is performed only at the beginning of the process and at a timing at which a predetermined condition is satisfied.
  • the initialization process is the timing at which the search unit 24 obtains a set of the target frame in which the feature points (the number of which is equal to or more than the predetermined threshold) are found and the reference image data.
  • i i is an integer of i ⁇ 2
  • the distance between the estimated position and the information on the position of the reference image data is a predetermined threshold (for example, 10 m) Do it when it exceeds.
  • the conversion processing unit 26 does not perform pose graph optimization and bundle adjustment processing until the first initialization is performed.
  • the conversion processing unit 26 rotates the SLAM map so that the plane coincides with the xz plane, assuming that the camera capturing the moving image data exists on the same plane as the first linear conversion in the initialization process.
  • the plane of the camera used at that time is estimated by principal component analysis of all the determined camera positions.
  • the similarity transformation is performed to bring the point p on the SLAM coordinate system in the first to i-th CSLAM-World closer to the point pworld on the corresponding world coordinate system ((1) formula).
  • the orientation of the camera capturing the object frame and the reference image data, and the position of the feature point of the SLAM map are transformed.
  • the first and second linear transformations here are both types of three-dimensional similarity transformations, and these transformations do not improve the scale drift.
  • the conversion processing unit 26 performs scale drift improvement processing by pose graph optimization each time position estimation is newly performed using the i-th reference image data.
  • the pose graph used here is, as illustrated in FIG. 3, a node Sn representing posture information of a camera that has captured a frame, a node Sm representing posture information of a camera that has captured reference image data, and a reference image
  • a first constraint relating to the posture between adjacent (temporally adjacent in the moving image data) nodes Sn including the node fp representing the position information associated with the data and representing the posture information of the camera that captured the frame Constraint) are mutually connected by e1. That is, the relative conversion between camera poses between adjacent frames is restricted by this constraint e1.
  • the node Sn corresponding to the target frame and the node Sm representing the posture information of the camera that captured the reference image data including the feature points corresponding to the feature points included in the They are connected to each other by the two constraints e2. That is, the relative conversion between the captured camera postures of the target node and the corresponding reference image data is restricted by the restriction e2.
  • the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information associated with the reference image data are mutually connected by the third constraint e3 relating to the distance. .
  • the conversion processing unit 26 suppresses the change of the camera's posture information at each node, and also suppresses the change of the moving direction of the camera. While, a change in scale is allowed to correct information on the movement path of the camera that has captured the frame. At this time, the conversion processing unit 26 causes the position of the reference image data associated with the target frame that is a part of the frame to be close to the information of the position in the world coordinate system originally associated with the reference image data. Make corrections.
  • the scale drift is improved by the constrained pose graph optimization in the three-dimensional similarity transformation group Sim (3).
  • pose graph optimization the camera posture is taken as an optimization variable, and optimization is performed in consideration of constraints on relative conversion between camera postures. That is, in the present embodiment, the pose graph is optimized by performing nonlinear deformation in consideration of scale drift using the pose graph.
  • the three-dimensional rigid transformation G belonging to the special Euclidean group SE (3) is defined by the following equation (2).
  • the rotation matrix A translation vector t (t is a vector quantity actually represented by a bold face, but is expressed as t in the description of this specification for convenience) is a three-dimensional vector quantity of a real component, s Is a nonnegative real number value.
  • the conversion from SE (3) to Sim (3) is performed by setting s of the scale component to 1 without changing R of the rotation matrix and t of the translation vector. That is, the three-dimensional similarity transformation S (S belongs to Sim (3)) is It becomes.
  • SO (3), SE (3) and Sim (3) all belong to the Lie group, are converted to the corresponding Lie algebra by the exponential mapping, and also define the inverse mapping, the logarithmic mapping.
  • Lie algebra is represented by a vector notation of coefficients.
  • the Lie algebra corresponding to Sim (3) is a seven-dimensional (seven degrees of freedom) vector including a component representing relative transformation in a six-dimensional (six degrees of freedom) vector representing posture information of a camera Notated, and its exponential map is It is defined as
  • T of the shoulder of a vector or matrix represents transposition (the same applies to the following).
  • W is a term similar to Rodriguez's formula.
  • the cost function associated with the constrained deformation of the pose graph is defined, and the cost function is minimized by the Levenberg-Marquardt method on Lie groups to obtain the original SLAM. While maintaining the structure of the map, the process of improving the scale drift and the process of bringing the corresponding points between the two coordinate systems of the SLAM coordinate system and the world coordinate system close to each other are performed at one time.
  • this logarithmic map is a 7-dimensional real vector It becomes.
  • the third constraint condition e3 relating to the distance between the node Sm representing the posture information of the camera that has captured the reference image data and the node fp representing the position information ym associated with the reference image data is Using, I assume.
  • the minimization of e1i, j and e2k, l serves to suppress changes in relative transformations between camera poses except for gradual scale changes. Further, the minimization of e3m works to bring the camera position of the reference image data closer to the information of the position in the world coordinate system associated with the reference image data.
  • the pause graph used by the position estimation device 1 is as follows.
  • Node Sn Attitude of the camera when capturing the nth key frame.
  • Sn ⁇ Sim (3) n ⁇ ⁇ 1, 2, ... N ⁇ .
  • Node Sm Posture of the camera when imaging the mth reference image data.
  • Sm ⁇ Sim (3) m ⁇ ⁇ 1, 2, ... M ⁇ .
  • Edge e1i, j restriction due to relative conversion between camera poses when shooting the i, j th key frame. (I, j) ⁇ C1 Edge e 2 k, l: constraint due to relative conversion between camera poses when shooting the reference image data of the k th and l th.
  • N is the total number of key frames
  • M is the number of target frames (the total number of reference image data associated with the target frames)
  • C 1 is a set of key frames in which the same feature points in the SLAM map are captured
  • C2 represents a set of the target frame and the corresponding reference image data.
  • the conversion processing unit 26 executes a process of optimizing the pose graph defined as described above. That is, the conversion processing unit 26 extracts at least one set of key frame group C1 (including N key frames) in which a common feature point is imaged from the frames of the received moving image data. Also, referring to the set C2 (M sets of reference image data) including the target frame and the corresponding feature points found by the search unit 24, the cost function on the next Lie manifold is Levenberg's ⁇ Estimated estimated posture information S1, S2 ... of the camera by minimization according to the Marquardt method.
  • the conversion processing unit 26 reflects the conversion by the optimization also on the position of the feature point in the SLAM map. This reflection may be carried out using widely known methods such as those used in H. Strasdat, et. Al., “Scale drift-aware large scale monocular slam,” Robotics: Science and Systems VI, 2010. it can.
  • the node Sm representing the posture of the camera capturing the m-th reference image data and the k-th and l-th reference image data are captured from the pose graph used by Strasdat et al.
  • the distance between an edge e2k, l representing a restriction due to relative conversion between camera postures, a posture Sm of the camera that photographed the reference image data, and position information ym in the world coordinate system associated with the reference image data By adding an edge e3m to represent, the scale of the SLAM map is corrected by making the camera position of reference image data such as a geotag image not close to a loop closure but close to the value in the world coordinate system represented by position information (geotag). I am improving the drift.
  • the conversion processing unit 26 adds, to the received moving image data frame, a key frame group C1 in which a common feature point other than the key frame group C1 subjected to the process of optimizing the pose graph is captured. If there is, the process of optimizing the pose graph is sequentially performed on the key frame group C1.
  • the conversion processing unit 26 further deforms the SLAM map by bundle adjustment (BA) including the constraints determined in relation to the reference image data.
  • the conversion processing unit 26 performs the next bundle adjustment.
  • the conversion processing unit 26 of the present embodiment combines the Cfp-kf reprojection error and the Cfp-geo reprojection error together to minimize.
  • the conversion processing unit 26 which performs the bundle adjustment performs reprojection error ri, j between the i-th feature point and the j-th camera posture, Ask as.
  • Xi is a coordinate of the feature point in the SLAM coordinate system
  • xi is a two-dimensional coordinate within the frame of the feature point
  • Rj and tj represent rotation and translation of the j-th camera posture.
  • (fx, fy) is the focal length
  • (cx, cy) is the projection Represents the coordinates of the center.
  • the total cost function is defined as follows, including the restriction on the reference image data.
  • Tj is information on the posture of the camera that photographed the j-th key frame, and is expressed as an element of SE (3).
  • is a Huber robust cost function
  • C5 represents a feature point in a key frame of C1.
  • C1 and C3 are the already mentioned sets.
  • the feature points of the key frame and the information on the camera pose can be obtained by minimizing the total cost function on this Lie manifold. This minimization can be done using the Levenberg-Marquardt method.
  • not only the general Cfp-kf reprojection error but also the Cfp-geo reprojection error is minimized in order to combine the constraints with the position information related to the reference image data.
  • Ru the posture information of the camera that photographed the reference image data is fixed. This variation can further reduce the scale drift of the SLAM map, particularly when a sufficiently good initial solution is provided.
  • the position estimation device 1 of the present embodiment has the above configuration and operates as follows.
  • the position estimation device 1 receives moving image data including a series of frames obtained by imaging a subject on a moving route while moving, and executes the process illustrated in FIG. 4 below. Do.
  • the position estimation device 1 of the present embodiment extracts feature points related to the subject captured in each frame included in the motion image data, and reconstructs the feature points as a predetermined virtual three-dimensional space.
  • the SLAM map which is a reconstruction map associated with the coordinates in the space, is generated (S1: SLAM processing).
  • the position estimation device 1 selects at least one of the frames included in the received moving image data as a target frame (S2), and repeats the following processing for each target frame.
  • the position estimation device 1 sets the selected target frames one by one in a predetermined order as the target frame.
  • the position estimation apparatus 1 is reference image data in which at least one subject imaged in the target frame is imaged, and positional information in a predetermined world coordinate system representing a position imaged in advance is associated
  • the reference image data being acquired is acquired (S3).
  • this acquisition process may be performed by instructing the server on the network to search, or by instructing the user to input the reference image data. Good.
  • the position estimation device 1 searches for the reference feature point corresponding to the feature point found from the target frame in the acquired reference image data (feature point matching: S4).
  • the position estimation device 1 checks whether the number of reference feature points searched here is found more than a predetermined threshold (for example, five) (S5).
  • the position estimation device 1 captures the reference image data acquired in step S3 if the reference feature point retrieved in step S5 is equal to or greater than a predetermined threshold (S5: Yes).
  • the position and the attitude of the obtained camera are estimated (S6).
  • This processing corresponds to the feature point coordinates (two-dimensional coordinates) in the target frame, the feature point coordinates (three-dimensional coordinates) in the SLAM map coordinate system, and the feature points in the reference image data. This can be performed using the coordinates (two-dimensional coordinates) of the reference feature point to be detected and the position information (three-dimensional coordinate values in the world coordinate system) associated with the reference image data.
  • the position estimation device 1 returns to the process S3 and sets the next target frame as the target frame and continues the processing.
  • the process proceeds to the next processing (processing S10 described below).
  • the process returns to the process S3 without executing the process S6, and the next target frame is set as the target frame. Set and continue processing. Also here, if there is no target frame that has not been set as the target frame at that time (if the processing for all the target frames is completed), the process proceeds to the next processing (processing S10 described below).
  • the position estimation device 1 uses the world coordinate system, which is the coordinate system of the position information associated with each reference image data, and the coordinate system of the coordinate values of the feature points found from each target frame.
  • a transformation relationship with a coordinate system of a SLAM map can be obtained.
  • the position estimation device 1 performs transformation processing of the SLAM map.
  • the position estimation device 1 first determines whether to perform an initialization process (S10).
  • S10 initialization process
  • information of the position estimated as the position of the camera last time (this estimation will be described later) and the corresponding reference image It may be determined that the initialization process is to be performed when the distance from the position information associated with the data exceeds a predetermined threshold (for example, 10 m).
  • the position estimation apparatus 1 executes the initialization process (S11), and first assumes that the camera capturing the moving image data exists on the same plane, and the next process I do. That is, the position estimation device 1 performs principal component analysis of the information on the position based on the position of the camera (camera that has captured the received moving image data) in the SLAM map obtained so far. Determine the coordinate axes of the plane where the camera is supposed to be located (eg the axis of the normal to the plane), and make this normal parallel to the y-axis (so that the plane where the camera is located is the xz plane) SLAM Rotate the map.
  • the position estimation device 1 uses the frame and reference image data obtained so far and the position information (position information in the world coordinate system) associated with the reference image data. Shrink the SLAM coordinate system so that the sum (or the sum of squares) of the absolute value of the difference between the coordinate value of each feature point on the SLAM coordinate system and the value of the corresponding reference feature point in the world coordinate system becomes minimal Enlarge (similarity conversion).
  • Node Sn Attitude of the camera when capturing an n-th key frame; Sn (Sim (3), n ⁇ ⁇ 1, 2, ... N ⁇ , Node Sm: Posture of the camera when imaging the mth reference image data; Sm ⁇ Sim (3), m ⁇ ⁇ 1, 2, ...
  • Edge e1i, j constraint due to relative conversion between camera poses when shooting the i, j th key frame; (i, j) ⁇ C1, Edge e 2 k, l: constraint due to relative conversion between camera postures when shooting the reference image data of the k th and l th; (k, l) ⁇ C 2, Edge e3 m: distance between the posture Sm of the camera that captured the reference image data and the position information ym in the world coordinate system associated with the reference image data; m ⁇ ⁇ 1, 2, ... M ⁇ , Optimize the constrained pose graph in Sim (3) including (S12).
  • the cost value related to the change in the estimation result of the position and orientation represented by SLAM coordinates in the reconstruction space (SLAM map) of the camera that has captured the moving image data, and scaling of the movement amount of the camera An optimization process is performed using a cost function that includes a cost value and a cost value related to the distance between the position of the camera and the position information associated with the reference image data, and the SLAM of the camera that captured moving image data Correction for scaling the movement amount of the camera while suppressing changes in the estimation result of the position and orientation represented by the coordinates, and bringing the position and orientation of the camera close to position information associated with the corresponding reference image data The correction is performed at one time.
  • the position estimation device 1 also performs bundle adjustment processing to correct the reprojection error of the feature point to the key frame and the reprojection error of the reference feature point to the reference image data (S13). As a result, an estimation result of the position and orientation represented by the world coordinate system in each target frame of the camera that has captured the moving image data can be obtained.
  • the position estimation device 1 outputs the information (result of estimation) of the position and orientation of the camera in the world coordinate system obtained in the processing up to this point (S14), and ends the processing.
  • the position estimation device 1 has been described as an example of receiving moving image data whose imaging has already been completed, but the present embodiment is not limited to this.
  • moving image data including a series of frames obtained by imaging an object on a moving path while moving is sequentially taken each time a part of frames (for example, one frame) is imaged.
  • the SLAM map is generated by sequential processing (that is, the feature points of the subject captured in the frame that can be sequentially received are extracted, and the feature points are extracted. Is associated with coordinates in the reconstruction space, which is a predetermined virtual three-dimensional space).
  • Such SLAM map generation processing is widely known as incremental SfM (Structure from Motion) or the like, and thus detailed description thereof is omitted here.
  • the position estimation device 1 sequentially performs the subsequent processing. That is, the position estimation device 1 determines whether or not the received frame is to be used as a key frame, and in the case of using it as a key frame, selects the key frame as an object frame and performs processing of step S3 and the following steps in FIG. You should do it.
  • the difference between the position of the camera estimated for a certain frame by the processing S12 and S13 in FIG. 4 and the position represented by the position information associated with the reference image data corresponding to the frame is a predetermined threshold. In the case of exceeding S.sub.0, the process may return to the process S10 and perform the initialization process again.
  • a position estimation method different from the embodiment of the present invention using reference image data in the following example, a geotag image
  • a position estimation method according to the embodiment of the present invention are compared to quantitatively evaluate the accuracy of the method of the embodiment in which the position estimation device 1 according to the embodiment of the present invention described above is implemented by a computer.
  • (1) initialization processing linear transformation: ILT
  • (2) optimization processing of constrained pose graph on Sim (3) PGO
  • About bundle modification processing BA which corrects so that the reprojection error to the reference image data of the re-projection error to the reference image data and the re-projection error to the reference image point becomes three steps of deformation processes (1) to (3) Examine the impact on accuracy of each process and its usefulness.
  • the Ma'laga Stereoand Laser Urban Data Set (Ma'laga data set) is used.
  • the video of this Ma'laga data set has a resolution of 1024 ⁇ 768 and a frame rate of 20 fps.
  • two types of video (Video 1 and 2) are cut out from the video and used for evaluation.
  • the path taken by the two types of images does not draw a loop, and the path length is 1 km or more.
  • all the frames are associated with GPS position information acquired every one second, a frame including an error of 10 m or more is also included.
  • the SLAM is used in the present embodiment, and the world coordinate system and the SLAM coordinate system are used.
  • the portion for acquiring the correspondence with (the reconstructed three-dimensional coordinate system) is replaced with the same processing as the present embodiment, and only the processing for deforming the SLAM map is compared.
  • the Kroeger method uses spline smoothing to smooth the camera pose.
  • the camera posture of the geotag image (reference image data) corresponding to the key frame is smoothed (Interpolation) using Cubic B-Spline.
  • Chang's method performs two-dimensional affine transformation on the xz plane so that the SLAM map restored from the input moving image data matches the 3D point group acquired from Google Street View, in the y-axis direction Performs scale conversion.
  • the method of applying the same conversion (Affine +) as the method of Chang to the SLAM map was used.
  • the method of the present embodiment is closer to GT than the other methods, and achieves a significant improvement in accuracy compared to the other methods.
  • the Kroeger method complements the correspondence between sparse geotags and images without considering the three-dimensional structure, a large error occurs in the position estimation where there are not enough corresponding points.
  • Chang's method uses a three-dimensional structure of the image, but uses a simple linear transformation, so it is greatly affected by distortion due to scale drift, and the accuracy is sufficiently high. In some cases, the position can not be estimated.
  • the absolute correspondence of the camera when each frame of moving image data is captured from the sparse correspondence between the geotag image and the captured moving image data using the result of three-dimensional reconstruction (SLAM) The position (the position in the world coordinate system, that is, information related to the latitude and longitude) was estimated.
  • SLAM three-dimensional reconstruction
  • processing for improving scale drift is integrated using sparse correspondence (association of only partial frames) between a geotag image and captured moving image data, and estimation of absolute position is performed. It has become possible to properly use the three-dimensional reconstruction structure.
  • the accuracy is considered to be improved as the error included in the position information such as the latitude and longitude associated with the reference image data is smaller.
  • the feature points in the frame included in the captured moving image data and the reference feature points included in the corresponding reference image data exist at a long distance from the position at which each image was captured, and a small number of them If there is, the matching accuracy between feature points will not be sufficient.
  • the camera points in a direction parallel to the roadway (moving locus), or when the angle of view of the camera is narrow the camera moves from the camera to the feature point group when estimating the position and orientation of the camera that captured the reference image data.
  • the error in the direction of is relatively large. This error mainly occurs in the direction parallel to the roadway.
  • Reference Signs List 1 position estimation apparatus 11 control unit, 12 storage unit, 13 operation unit, 14 display unit, 15 communication unit, 16 interface unit, 21 reception unit, 22 SLAM processing unit, 23 reference image data acquisition unit, 24 search unit, 25 Conversion relationship acquisition unit, 26 conversion processing unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un dispositif d'estimation de position qui accepte des données d'images animées comprenant une série de trames obtenues en capturant des images d'un sujet sur un trajet de déplacement pendant le déplacement, qui extrait un point caractéristique relatif au sujet capturé dans chaque trame comprise dans les données d'images animées, et qui crée une carte de reconfiguration dans laquelle le point caractéristique est associé à des coordonnées dans un espace de reconfiguration, qui est un espace tridimensionnel virtuel prescrit. Le dispositif d'estimation de position récupère un point caractéristique de référence correspondant au point caractéristique dans des données d'image de référence associées à des informations de position, acquiert une relation de transformation entre un système de coordonnées mondiales, qui est un système de coordonnées des informations de position, et un système de coordonnées de la coordonnée associée au point caractéristique, et utilise la relation de transformation pour corriger la carte de reconfiguration en effectuant une correction à l'échelle d'une quantité de déplacement d'une caméra qui a capturé les données d'images animées, tout en supprimant un changement dans un résultat estimé d'une position et d'une attitude de la caméra, représenté à l'aide des coordonnées dans l'espace de reconfiguration, et en effectuant une correction de telle sorte que la position et l'attitude de la caméra s'approchent des informations de position associées aux données d'image de référence.
PCT/JP2018/023697 2017-06-21 2018-06-21 Dispositif, procédé et programme d'estimation de position WO2018235923A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762522857P 2017-06-21 2017-06-21
US62/522,857 2017-06-21

Publications (1)

Publication Number Publication Date
WO2018235923A1 true WO2018235923A1 (fr) 2018-12-27

Family

ID=64737652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/023697 WO2018235923A1 (fr) 2017-06-21 2018-06-21 Dispositif, procédé et programme d'estimation de position

Country Status (1)

Country Link
WO (1) WO2018235923A1 (fr)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110106755A (zh) * 2019-04-04 2019-08-09 武汉大学 利用姿态重构铁轨几何形态的高铁轨道不平顺性检测方法
CN110570473A (zh) * 2019-09-12 2019-12-13 河北工业大学 一种基于点线融合的权重自适应位姿估计方法
CN111882494A (zh) * 2020-06-28 2020-11-03 广州文远知行科技有限公司 位姿图处理方法、装置、计算机设备和存储介质
CN112819970A (zh) * 2021-02-19 2021-05-18 联想(北京)有限公司 一种控制方法、装置及电子设备
WO2021100650A1 (fr) * 2019-11-22 2021-05-27 パナソニックIpマネジメント株式会社 Dispositif d'estimation de position, véhicule, procédé d'estimation de position et programme d'estimation de position
CN113781550A (zh) * 2021-08-10 2021-12-10 国网河北省电力有限公司保定供电分公司 一种四足机器人定位方法与系统
CN113779012A (zh) * 2021-09-16 2021-12-10 中国电子科技集团公司第五十四研究所 一种用于无人机的单目视觉slam尺度恢复方法
WO2022118061A1 (fr) * 2020-12-04 2022-06-09 Hinge Health, Inc. Localisations tridimensionnelles d'objets dans des images ou des vidéos
CN115700507A (zh) * 2021-07-30 2023-02-07 北京小米移动软件有限公司 地图更新方法及其装置
JP2023529786A (ja) * 2021-05-07 2023-07-12 テンセント・アメリカ・エルエルシー パノラマ画像におけるグラウンド上のマーカを認識することによってカメラ間のポーズグラフおよび変換マトリクスを推定する方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013535013A (ja) * 2010-06-25 2013-09-09 トリンブル ナビゲーション リミテッド 画像ベースの測位のための方法および装置

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013535013A (ja) * 2010-06-25 2013-09-09 トリンブル ナビゲーション リミテッド 画像ベースの測位のための方法および装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHANG, SHAOPIN ET AL.: "Extracting Driving Behavior: Global Metric Localization from Dashcam Video in the Wild", COMPUTER VISION - ECCV 2016 WORKSHOPS, 2016, pages 136 - 148, XP055565003 *
IWAMI, KAZUYA ET AL.: "Global Metric Localization and Correction of Scale Drift using Street View", IEICE TECHNICAL REPORT, vol. 117, no. 106, 15 June 2017 (2017-06-15), pages 69 - 74 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110106755A (zh) * 2019-04-04 2019-08-09 武汉大学 利用姿态重构铁轨几何形态的高铁轨道不平顺性检测方法
CN110106755B (zh) * 2019-04-04 2020-11-03 武汉大学 利用姿态重构铁轨几何形态的高铁轨道不平顺性检测方法
CN110570473A (zh) * 2019-09-12 2019-12-13 河北工业大学 一种基于点线融合的权重自适应位姿估计方法
WO2021100650A1 (fr) * 2019-11-22 2021-05-27 パナソニックIpマネジメント株式会社 Dispositif d'estimation de position, véhicule, procédé d'estimation de position et programme d'estimation de position
JP2021082181A (ja) * 2019-11-22 2021-05-27 パナソニックIpマネジメント株式会社 位置推定装置、車両、位置推定方法、及び位置推定プログラム
CN111882494A (zh) * 2020-06-28 2020-11-03 广州文远知行科技有限公司 位姿图处理方法、装置、计算机设备和存储介质
CN111882494B (zh) * 2020-06-28 2024-05-14 广州文远知行科技有限公司 位姿图处理方法、装置、计算机设备和存储介质
WO2022118061A1 (fr) * 2020-12-04 2022-06-09 Hinge Health, Inc. Localisations tridimensionnelles d'objets dans des images ou des vidéos
CN112819970A (zh) * 2021-02-19 2021-05-18 联想(北京)有限公司 一种控制方法、装置及电子设备
CN112819970B (zh) * 2021-02-19 2023-12-26 联想(北京)有限公司 一种控制方法、装置及电子设备
JP2023529786A (ja) * 2021-05-07 2023-07-12 テンセント・アメリカ・エルエルシー パノラマ画像におけるグラウンド上のマーカを認識することによってカメラ間のポーズグラフおよび変換マトリクスを推定する方法
CN115700507A (zh) * 2021-07-30 2023-02-07 北京小米移动软件有限公司 地图更新方法及其装置
CN115700507B (zh) * 2021-07-30 2024-02-13 北京小米移动软件有限公司 地图更新方法及其装置
CN113781550A (zh) * 2021-08-10 2021-12-10 国网河北省电力有限公司保定供电分公司 一种四足机器人定位方法与系统
CN113779012B (zh) * 2021-09-16 2023-03-07 中国电子科技集团公司第五十四研究所 一种用于无人机的单目视觉slam尺度恢复方法
CN113779012A (zh) * 2021-09-16 2021-12-10 中国电子科技集团公司第五十四研究所 一种用于无人机的单目视觉slam尺度恢复方法

Similar Documents

Publication Publication Date Title
WO2018235923A1 (fr) Dispositif, procédé et programme d'estimation de position
US10269147B2 (en) Real-time camera position estimation with drift mitigation in incremental structure from motion
Li et al. DeepI2P: Image-to-point cloud registration via deep classification
US11176701B2 (en) Position estimation system and position estimation method
US10269148B2 (en) Real-time image undistortion for incremental 3D reconstruction
US20180315232A1 (en) Real-time incremental 3d reconstruction of sensor data
US6587601B1 (en) Method and apparatus for performing geo-spatial registration using a Euclidean representation
US9396583B2 (en) Method of modelling buildings on the basis of a georeferenced image
US20170236284A1 (en) Registration of aerial imagery to vector road maps with on-road vehicular detection and tracking
CN111862126A (zh) 深度学习与几何算法结合的非合作目标相对位姿估计方法
WO2023045455A1 (fr) Procédé de reconstruction tridimensionnelle de cible non coopérative sur la base d'un enregistrement de reconstruction de branche
US20170092015A1 (en) Generating Scene Reconstructions from Images
CN112750203B (zh) 模型重建方法、装置、设备及存储介质
Tao et al. Automated localisation of Mars rovers using co-registered HiRISE-CTX-HRSC orthorectified images and wide baseline Navcam orthorectified mosaics
CN114565863B (zh) 无人机图像的正射影像实时生成方法、装置、介质及设备
CN112132754B (zh) 一种车辆移动轨迹修正方法及相关装置
Patil et al. A new stereo benchmarking dataset for satellite images
Zhao et al. RTSfM: Real-time structure from motion for mosaicing and DSM mapping of sequential aerial images with low overlap
CN111553845A (zh) 一种基于优化的三维重建的快速图像拼接方法
CN110570474A (zh) 一种深度相机的位姿估计方法及系统
US20240161392A1 (en) Point cloud model processing method and apparatus, and readable storage medium
Suliman et al. Development of line-of-sight digital surface model for co-registering off-nadir VHR satellite imagery with elevation data
Zhao et al. Fast georeferenced aerial image stitching with absolute rotation averaging and planar-restricted pose graph
Fu-Sheng et al. Batch reconstruction from UAV images with prior information
CN113012084A (zh) 无人机影像实时拼接方法、装置及终端设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18819744

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18819744

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP