CN110223380B

CN110223380B - Scene modeling method, system and device fusing aerial photography and ground visual angle images

Info

Publication number: CN110223380B
Application number: CN201910502762.4A
Authority: CN
Inventors: 申抒含; 高翔; 朱灵杰; 胡占义
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2021-04-23
Anticipated expiration: 2039-06-11
Also published as: CN110223380A

Abstract

The invention belongs to the field of scene modeling, and particularly relates to a scene modeling method, a scene modeling system and a scene modeling device for fusing aerial images and ground view images, aiming at solving the problems of complex indoor scene structure, lack of textures, incomplete image-based modeling results and inaccurate fusion. The method comprises the following steps: s100, acquiring an aerial photography view angle image of an indoor scene to be modeled, and constructing an aerial photography map; s200, acquiring a synthetic image by a method of synthesizing a ground visual angle reference image from the aerial map based on the aerial map; s300, acquiring a ground visual angle image set through a ground visual angle image acquired by a ground camera; and S400, fusing the aerial visual angle image and the ground visual angle image based on the synthetic image to obtain an indoor scene model. The method can generate a complete and accurate indoor scene model, has both acquisition efficiency and reconstruction accuracy, and has stronger robustness.

Description

Scene modeling method, system and device fusing aerial photography and ground visual angle images

Technical Field

The invention belongs to the field of scene modeling, and particularly relates to a scene modeling method, system and device fusing aerial photography and ground visual angle images.

Background

Three-dimensional reconstruction of indoor scenes plays an important role in many real-world applications, such as indoor navigation, service robots, Building Information Modeling (BIM), and the like. Existing indoor scene reconstruction methods can be roughly divided into three categories: (1) a LiDAR (light detection and ranging) based method, (2) an RGB-D camera based method, and (3) an image based method.

Although the method based on the LiDAR and the method based on the RGB-D camera have higher precision, when a larger indoor scene is reconstructed, the shielding of the scene is difficult to avoid due to the limitation of the scanning visual angle, and the two methods have the problems of higher cost, poorer expansibility and the like. For LiDAR based methods, multiple views of the laser scan are often required to align with the point cloud as it is being traced. For the method based on the RGB-D camera, a large amount of data needs to be acquired and processed due to the limited effective working distance of the sensor. Therefore, the methods have the defects of high cost and low efficiency when large-scale indoor scene reconstruction is carried out.

Despite the lower cost and greater flexibility of image-based methods relative to LiDAR-based methods and RGB-D camera-based methods, such methods also suffer from drawbacks such as incomplete, inaccurate reconstruction results due to complex scenes, repetitive structures, lack of texture, etc. Even though the most advanced technologies of structure from motion (SfM) and multi-view stereo (MVS) are still unsatisfactory, the reconstruction effect in indoor scenes with large scale and complicated structure is still unsatisfactory. In addition, some image-based approaches deal with the indoor scene reconstruction problem using some a priori assumptions, such as the manhattan world assumption. Although these methods sometimes yield better results, they often lead to erroneous reconstruction results without conforming to the a priori assumptions.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problems of complex structure, lack of texture, incomplete and inaccurate fusion of the modeling result based on the image in the indoor scene, the first aspect of the present invention provides a scene modeling method for fusing aerial photography and ground perspective images, comprising the following steps:

s100, acquiring an aerial photography view angle image of an indoor scene to be modeled, and constructing an aerial photography map;

step S200, acquiring a synthetic image by a method of synthesizing a ground visual angle reference image from the aerial map based on the aerial map;

step S300, acquiring a ground visual angle image set through a ground visual angle image acquired by a ground camera;

and S400, fusing the aerial visual angle image and the ground visual angle image based on the synthetic image to obtain an indoor scene model.

In some preferred embodiments, in step S100, "acquiring an aerial view angle image of an indoor scene to be modeled, and constructing an aerial map", a method thereof is:

extracting image frames of the aerial photography visual angle video of the indoor scene by adopting a self-adaptive video frame extraction method based on a bag-of-words model to obtain an aerial photography visual angle image set of the indoor scene;

and constructing an aerial map by an image modeling method based on the aerial visual angle image set.

In some preferred embodiments, the "method for synthesizing the ground perspective reference image from the aerial map" in step S200 is as follows:

calculating the pose of the virtual camera based on the aerial image;

acquiring a synthetic image of a ground visual angle reference image based on an aerial map through a graph cut algorithm;

in some preferred embodiments, the "acquiring a composite image of the ground perspective reference image based on the aerial map by the graph cut algorithm" includes:

wherein E (l) is an energy function in the graph cutting process;

set of two-dimensional triangles projected on a three-dimensional grid visible to a virtual camera, t_iIs the ith triangle;

a public edge set of the triangles in the two-dimensional triangle set obtained by projection; l_iIs t_iThe aerial image sequence number of (1); d_i(l_i) Is a data item; v_i(l_i,l_j) Is a smoothing term;

when corresponding to t_iIn the l-th space patch_iData items when visible in an aerial image

Otherwise D_i(l_i) α, wherein

Is the first_iThe median of the dimensions of the local features in the individual aerial images,

to correspond to t_iIn the l-th space patch_iThe projection area in each aerial image, alpha is a large constant;

when l is_i＝l_jTime, smoothing term V_i(l_i,l_j) 0; otherwise V_i(l_i,l_j)＝1。

In some preferred embodiments, in step S300, "acquiring a ground perspective image set by using a ground perspective image collected by a ground camera", the method includes:

the ground robot continuously collects ground visual angle videos through a ground camera arranged on the ground robot based on the planned path;

and extracting image frames of the ground visual angle video of the indoor scene by adopting a self-adaptive video frame extraction method based on the bag-of-words model to obtain a ground visual angle image set of the indoor scene.

In some preferred embodiments, in the process of continuously acquiring ground visual angle videos by a ground camera arranged on a ground robot based on a planned path, a positioning method of the ground robot comprises initial robot positioning and mobile robot positioning;

the initial robot positioning method comprises the following steps: acquiring a first frame of a video acquired by a ground camera, acquiring an initial position of the robot in the aerial photography map, and taking the position as a starting point of the subsequent movement of the robot;

the method for positioning the mobile robot comprises the following steps: and carrying out rough positioning on the position of the robot based on the initial position and the running data of the robot at each moment, acquiring the position of the robot in the aerial photography map at the current moment by matching the video frame image acquired at the current moment with the composite image, and revising the position information of the rough positioning according to the position.

In some preferred embodiments, in step S400, "based on the composite image, the aerial view image and the ground view image are fused to obtain an indoor scene model", and the method includes:

acquiring the position of a ground camera corresponding to each image in the ground visual angle image set in the aerial photography map;

connecting the ground visual angle image and the synthetic image matching point into the original aerial photography and ground characteristic point track to generate cross-view constraint;

optimizing the position and pose of the aerial photo and the ground image through Binding Adjustment (BA);

and performing dense reconstruction by using the aerial photography and ground visual angle images to obtain a dense model of the indoor scene.

The invention provides a scene modeling system fusing aerial photography and ground visual angle images, which comprises an aerial photography map building module, a synthetic image acquisition module, a visual angle image set acquisition module and an indoor scene model acquisition module;

the aerial photography map building module is configured to obtain an aerial photography view angle image of an indoor scene to be modeled and build an aerial photography map;

the synthetic image acquisition module is configured to acquire a synthetic image by a method of synthesizing a ground visual angle reference image from the aerial map based on the aerial map;

the visual angle image set acquisition module is configured to acquire a ground visual angle image set through ground visual angle images acquired by a ground camera;

and the indoor scene model acquisition module is configured to fuse the aerial photography view angle image and the ground view angle image based on the synthetic image to acquire an indoor scene model.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned scene modeling method of fusing aerial and ground perspective images.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to implement the scene modeling method for fusing aerial and ground perspective images as described above.

The invention has the beneficial effects that:

the invention guides the robot to move in an indoor scene and collects ground visual angle images by constructing a three-dimensional aerial image, then the aerial image and the ground image are fused, and a complete and accurate indoor scene model is generated by the fused image. The indoor scene reconstruction process has the advantages of both acquisition efficiency and reconstruction accuracy, and has stronger robustness.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic view of a scene modeling method for fusing aerial and ground perspective images according to an embodiment of the present invention;

FIG. 2 is an exemplary illustration of an aerial map reconstructed from 271 decimated video frames in accordance with one embodiment of the present invention;

FIG. 3 is a schematic diagram of a grid-based image composition in accordance with an embodiment of the present invention;

FIG. 4 is a diagram illustrating an example of the relationship between local feature size and image sharpness for one embodiment of the present invention;

FIG. 5 is an exemplary graph of image composition results based on graph cuts in various configurations in accordance with an embodiment of the present invention;

FIG. 6 is an example diagram of some other image synthesis results by comparison and ground images at similar viewing angles;

FIG. 7 is an exemplary diagram of image matching results in one embodiment of the invention;

FIG. 8 is a schematic diagram illustrating the search of candidate matching composite images during robot motion according to an embodiment of the present invention;

FIG. 9 is a flow chart illustrating a batch camera positioning process according to an embodiment of the present invention;

FIG. 10 is an exemplary illustration of a batch-based camera positioning result based on three feature point trajectories, in accordance with an embodiment of the present invention;

FIG. 11 is a diagram illustrating an exemplary batch camera positioning process in accordance with an embodiment of the present invention;

FIG. 12 is a schematic diagram illustrating aerial and ground feature point trajectory generation for an aerial view in accordance with an embodiment of the present invention;

FIG. 13 is a data acquisition device used in testing of one embodiment of the present invention;

FIG. 14 is an illustration of an example aerial image in a Hall dataset and a generated three-dimensional aerial map in a test of an embodiment of the present invention;

FIG. 15 is an exemplary graph of a comparison experiment result of a frame extraction method of the present invention and an equally spaced frame extraction method on a Hall data set aerial video under test in accordance with an embodiment of the present invention;

FIG. 16 is an exemplary plot of qualitative comparison results of ground camera positioning in a test according to one embodiment of the present invention;

FIG. 17 is an exemplary graph of qualitative results of indoor scene reconstruction under test according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Due to the complexity of indoor scenes, the following two issues need to be considered for the image-based approach to achieve complete reconstruction of the scene. The first is the image acquisition process, i.e., how to acquire images to completely and efficiently cover an indoor scene. The second is a scene reconstruction algorithm, i.e. how to fuse images of different viewing angles in the SfM and MVS processes to obtain a complete and accurate reconstruction result. Aiming at the two problems, the invention provides a novel indoor scene acquisition and reconstruction process based on images. The process uses a mini-aircraft and a ground robot and comprises four main steps (as shown in fig. 1): (1) constructing an aerial photography map: acquiring an aerial photography visual angle image indoors by adopting a mini aircraft, acquiring a triangular mesh representing an indoor scene from the aerial photography visual angle image, and using the triangular mesh for positioning and navigating a map for a ground robot; (2) and (3) reference image synthesis: and carrying out plane detection in the aerial photography map, acquiring a ground plane and planning a path of the ground robot. Then, synthesizing a plurality of ground visual angle images based on the aerial photography map for positioning the ground robot; (3) positioning a ground robot: and the ground robot enters an indoor scene to acquire ground visual angle images. When the robot moves and collects images, the robot is positioned by matching the collected images with the synthesized ground visual angle image; (4) indoor scene reconstruction: after the ground robot finishes image acquisition, the mini aircraft image and the ground robot image are fused in the image-based modeling process, so that complete and accurate modeling of an indoor scene is realized.

In the modeling process, only manual operation is needed in the aerial image acquisition process, and the subsequent ground image acquisition and indoor scene modeling processes are all realized in a full-automatic mode, which means that the process of the invention has strong expansibility and is suitable for acquisition and reconstruction of large-scale indoor scenes. The collection of aerial images can also be automatically collected according to the obtained navigation path through autonomous navigation, but the complexity of the algorithm is increased, so that manual operation is preferentially selected to ensure the flexibility and integrity of the obtained images and the expansibility.

The aerial images captured by the mini-aircraft have a better viewing angle and a larger field of view than the ground images captured by the ground robot, which means that occlusion and mismatch problems in the aerial images can be smaller relative to the ground images. Thus, the map generated from the aerial image can be used more reliably in the subsequent ground robot positioning process.

Aerial images taken by the mini-aircraft and ground images taken by the ground robot complement each other and can completely cover an indoor scene. Therefore, a more complete and accurate indoor scene model can be obtained by fusing the aerial photography and the ground image.

The invention discloses a scene modeling method fusing aerial photography and ground visual angle images, which comprises the following steps:

In order to more clearly explain the scene modeling method for fusing aerial photography and ground perspective images, the following is a detailed description of the steps in an embodiment of the method according to the present invention with reference to the accompanying drawings.

The scene modeling method fusing the aerial photography and the ground perspective image comprises the steps S100-S400.

And S100, acquiring an aerial photography view angle image of an indoor scene to be modeled, and constructing an aerial photography map.

Firstly, acquiring an aerial video of an indoor scene by adopting a mini aircraft, and extracting some images from the video. And then reconstructing the extracted image through a process based on image modeling to obtain an aerial photography model, and using the aerial photography model as a three-dimensional map for positioning the ground robot.

And S101, extracting image frames of the aerial photography visual angle video of the indoor scene by adopting a self-adaptive video frame extraction method based on a bag-of-words model to obtain an aerial photography visual angle image set of the indoor scene.

In this embodiment, a miniature aircraft is used to collect a top-down aerial photography view video in an indoor scene, the resolution of the collected video is 1080p, and the frame rate is 25 FPS. Because mini unmanned aerial vehicle size is little, the flexibility ratio is high, very is applicable to indoor scene and shoots. For example, the mini-aircraft used in this embodiment is a DJI Spark equipped with a stabilizer and a 4K camera, and the weight of the mini-aircraft is only 300 g. In addition, for the ground visual angle, the indoor scene is shot from the aerial shooting visual angle and is not easily influenced by the scene shielding, so that the scene can be more efficiently and completely covered by adopting the mini aircraft.

Given the acquired aerial video, an aerial map can be constructed by a simultaneous localization and mapping (SLAM) system. However, in the present embodiment, the offline SfM technique is adopted for the aerial map construction. This is because: (1) in the embodiment, the aerial photography map is used for positioning the ground robot, so that online construction is not needed; (2) compared with SLAM which is easy to generate scene drift phenomenon, SfM is more suitable for large-scale scene modeling. However, if the SfM is used for the construction of the aerial map, it is obvious that all frames in the aerial video are not needed. This can seriously reduce the efficiency of SfM mapping because of the large amount of redundant information contained in the aerial video frames. To solve the above problem, a straightforward solution is to extract one frame at fixed frame intervals in the video and then map the extracted video frames. However, this approach still has some disadvantages: (1) it is difficult to achieve stable, constant-speed video acquisition in indoor scenes by manually operating a mini-aircraft, and this problem becomes more difficult at the corners of the flight path; (2) since the texture richness in indoor scenes is not consistent, even coverage of the scene is not appropriate. In order to solve the above problems in the process of constructing an aerial photography map, in this embodiment, an adaptive video frame extracting method based on a bag of words (BoW) model is adopted, and the process is detailed as follows:

in the BoW model, an image can be represented as a normalized vector v_iAnd the similarity of a pair of images can be multiplied by the point of the corresponding vector

And (4) showing. As known to those skilled in the art, too high similarity between adjacent images may introduce too much redundant information, thereby reducing the patterning efficiency; and too low similarity between adjacent images can result in poor connectivity between images and incomplete composition. Therefore, in the present embodiment, a method for adaptively extracting a subset from the whole video frames is proposed, which limits the similarity between each extracted video frame and its neighboring extracted video frame within a suitable range during frame extraction. Specifically, a normalized vector v of each frame is first generated by a libvot library_iAnd takes the first frame as a starting point. In the frame extraction process, assuming that the current ith frame is extracted, the score of the similarity between the frame and the subsequent frame is obtained: { s_i,jI +1, i +2, … }, wherein

Then, will

Comparing with a preset similarity threshold t, wherein t is 0.1 in the embodiment; suppose that

Is { s }_i,jThe first one of which satisfies the following inequality: s_i,jIf t is less than j ^*1 frame (i.e., the first previous frame satisfying the above inequality) is the next decimated video frame. The above process is iterated until all video frames are verified.

And S102, constructing an aerial image map by an image modeling method based on the aerial visual angle image set.

Based on the aerial photography view angle image set obtained in the step S101, an aerial photography map is constructed through a set of standard image modeling based process, and the process comprises the following steps: (1) SfM, (2) MVS, (3) surface reconstruction. In addition, since no GPS signal is received indoors, the aerial map can be scaled to its true physical size by Ground Control Point (GCP). Fig. 2 is an example of an aerial photography map reconstructed from 271 extracted video frames, the first three columns in the map are an example aerial photography image and a three-dimensional aerial photography map area corresponding to the example aerial photography image, the fourth column is a whole three-dimensional aerial photography map, the fifth column is a robot path planning and virtual camera pose calculation result on the aerial photography map, wherein the ground plane is marked with light gray, the planned path is marked with a line segment in the map, and the virtual camera pose is represented by a pyramid.

And S200, acquiring a synthetic image by a method of synthesizing a ground visual angle reference image from the aerial map based on the aerial map.

The aerial image map constructed in step S100 of this embodiment plays two roles in the subsequent process: the first is planning a path for a ground robot and positioning the ground robot in the moving process of the ground robot; the second is to help the fusion of aerial and ground images during the indoor scene reconstruction process. Both of the above two processes require the establishment of a two-dimensional to three-dimensional point correspondence between the ground image and the aerial image. To obtain the above-mentioned corresponding points, one potentially effective solution is to directly match the aerial and ground images. However, since the two images are greatly different in view angle, it is very difficult to directly match them. Here, the present embodiment solves the above-described problems by synthesizing the ground perspective reference image from the aerial map. The reference image is synthesized through the following two steps: virtual camera pose calculation and graph cut based image synthesis.

Step S201, calculating the position and the pose of the virtual camera based on the aerial photography map.

The virtual camera pose for reference image synthesis is calculated based on the ground plane of the indoor scene, and the ground plane of the aerial photography map in this embodiment is detected by a plane detection method based on random sample consensus (see fig. 2). The virtual camera pose is calculated in two steps, the position is calculated first, and then the orientation is calculated.

In step S2011, a virtual camera position is calculated.

A two-dimensional bounding box of the ground plane is solved and divided into square grids, the size of which determines the number of virtual cameras. In order to balance the positioning accuracy and the efficiency, the grid side length is set to 1m in the present embodiment. For each grid, when the proportion of the ground plane area in the grid to the total area of the grid is more than 50%, the grid is considered as an effective grid for placing a virtual camera. The virtual camera position is set to the center of the active grid with an elevation offset of height h (see fig. 2). The value of h is determined by the height of the ground camera, which is set to 1m in this embodiment.

Step S2012, the virtual camera is oriented to the design.

After obtaining the virtual camera positions, in order to realize omnidirectional observation of a scene, a plurality of virtual cameras with the same optical center and different directions need to be placed at each virtual camera position. In this embodiment, since the optical axis of the camera mounted on the ground robot is approximately parallel to the ground plane, only the horizontally oriented virtual camera is generated here. In order to eliminate perspective projection distortion between the ground and the composite image, the field of view (intrinsic parameters) of the virtual camera needs to be set close to the ground camera. In this embodiment, 6 virtual cameras are placed at each virtual camera position, with a yaw angle between the virtual cameras of 60 °.

In addition, the path for the ground robot motion is also planned through the ground plane of inspection. Since the present embodiment does not focus on planning an optimal path of the ground robot, a skeleton of the detected ground plane is used as the robot path, and the skeleton is extracted by a central axis transformation method (see fig. 2).

And S202, acquiring a synthetic image of the ground visual angle reference image based on the aerial map through a graph cut algorithm.

This embodiment performs the mapping by means of a spatially continuous gridImage synthesis, as shown in FIG. 3, where f is a three-dimensional space patch, which is taken at aerial camera C_aWith virtual ground camera C_vThe two-dimensional projection triangles on the camera are respectively denoted as t_aAnd t_vThe principle of image composition is to combine t_aChange from f to t_v. Specifically, a visible grid for each aerial and virtual camera is acquired. Then, for each virtual camera, its visible mesh is projected onto that camera to form a set of two-dimensional triangles. In performing virtual image synthesis, for a particular two-dimensional triangle in the virtual image, it is necessary to determine which aerial image to use for transformation to fill this area based on three factors: (1) visibility, namely, for the three-dimensional space patch corresponding to the two-dimensional triangle, the selected aerial image should have a better visual angle and a closer visual distance; (2) definition, as a part of images obtained by frame extraction of indoor aerial video has poor definition, enough clear aerial images need to be selected; (3) consistency, adjacent triangles in the virtual image should be synthesized from the same aerial image as much as possible to maintain consistency of the synthesized image. In this embodiment, the visibility factor is measured by a projection area of a spatial patch on an aerial image (the larger the visibility factor is, the better the visibility factor is), and the clarity factor is measured by a median of local feature scales of the aerial image (the smaller the visibility factor is, the better the visibility factor is), specifically, see fig. 4, two columns on the left side are two images with the largest median of the local feature scales, two columns on the right side are two images with the smallest median of the local feature scales, and the second row is an enlarged image of a rectangular region in the first row. Based on the above description, the image composition problem in the present embodiment can be summarized as a multi-label optimization problem, and is defined as shown in formula (1):

wherein E (l) is an energy function in the graph cutting process;

a public edge set of the triangles in the two-dimensional triangle set obtained by projection; l_iIs t_iThe label of (1), namely the aerial image serial number. When corresponding to t_iIn the l-th space patch_iData items when visible in an aerial image

Wherein

Is the first_iMedian of the dimensions of local features in an aerial image

To correspond to t_iIn the l-th space patch_iA projected area in the aerial image; otherwise D_i(l_i) α, where α is a large constant (in this embodiment, α is 10 ═ a)⁴) To penalize this situation. When l is_i＝l_jTime, smoothing term V_i(l_i,l_j) 0; otherwise V_i(l_i,l_j) 1. The optimization problem defined in equation (1) can be solved efficiently by graph cut algorithms.

To clarify the influence of the sharpness factor and the consistency factor, in this embodiment, image synthesis is performed on one of the virtual cameras under four different configurations, and the results are shown in fig. 5, which respectively include, from left to right: neither the definition factor nor the consistency factor is considered; only consistency factors are considered; only the sharpness factor is considered; and (3) taking the definition factor and the consistency factor into consideration. The large rectangle at the upper right corner of each graph is a large square graph of the small rectangles in the graph. As can be seen from fig. 5, the sharpness factor makes the composite image clearer, and the consistency factor makes the composite image have fewer holes and sharp edges. In addition, fig. 6 shows some other image synthesis results and ground images from similar viewing angles. Although there are still some synthesis error conditions which are difficult to avoid, the synthesized image has a larger similarity with the corresponding ground image in the common visible region, which verifies the effectiveness of the image synthesis method in the embodiment. The composite image in this step will be used as a reference database for ground robot positioning.

And step S300, acquiring a ground visual angle image set through the ground visual angle images acquired by the ground camera.

When the ground robot is placed in an indoor scene, the robot moves along a planned path and automatically collects ground visual angle videos. If the robot is positioned only by its built-in sensors, e.g. wheel encoders and Inertial Measurement Units (IMU), it will not move exactly along the planned path. This is because the built-in sensors of robots suffer from cumulative errors, which is particularly significant for low cost sensors mounted on consumer-grade robots. Therefore, the pose of the robot needs to be corrected by means of visual positioning, and the visual positioning is realized by matching synthesis and ground images in the step.

And S301, continuously acquiring ground visual angle videos by the ground robot through a ground camera arranged on the ground robot based on the planned path.

In this step, the positioning method includes initial robot positioning and mobile robot positioning.

(1) Initial robot positioning

The method for initial robot positioning comprises the following steps: and acquiring a first frame of a video acquired by a ground camera, acquiring an initial position of the robot in the aerial photography map, and taking the position as a starting point of the subsequent movement of the robot.

By positioning the first frame of the video collected by the ground camera, the initial position of the robot in the aerial photography map can be obtained, and the position is used as the starting point of the subsequent movement of the robot. The initial positioning can be achieved by matching the first frame image with all the composite images or k most similar composite images obtained by semantic tree retrieval. The method based on image retrieval is used in this step, and k is 30. It should be noted that although the ground perspective image is synthesized, the ground image and the synthesized image have a large difference in illumination, perspective, and the like, and the commonly used scale-invariant feature transform (SIFT) feature is not sufficient to cope with the difference. The ASIFT (after-SIFT) feature is adopted in the step.

In order to verify the validity of the image synthesis method in this step and compare the SIFT features with the ASIFT features, the present embodiment respectively uses the SIFT features and the ASIFT features to perform synthesis and ground image matching and aerial photography and ground image matching. The ground image is extracted from the video acquired by the ground robot by the adaptive video frame extraction method based on the bag-of-words model in step S100. When image matching is carried out, different numbers of composite images and aerial images which are most similar to the current ground image are searched, and when the matching point number after verification of the basic matrix is still larger than 16, the pair of images can be considered to be matched. The image matching results are shown in fig. 7 (in the figure, the x-axis represents the number of search images, and the y-axis represents the logarithm of the number of pairs of matching images). As can be seen from fig. 7, the matching logarithm obtained by performing the synthesis by ASIFT and the ground image matching is about 6 times, 8 times, and 19 times of the matching logarithm obtained by performing the aerial photography by ASIFT and the ground image matching by SIFT, the matching logarithm obtained by performing the synthesis by SIFT and the ground image matching by SIFT, and the matching logarithm obtained by performing the aerial photography by SIFT and the ground image matching by SIFT, respectively.

And giving a two-dimensional matching point between the first frame of ground image and the retrieved synthetic image, and acquiring a corresponding three-dimensional space point on the aerial photography map in a ray projection mode. Therefore, the positioning of the first frame of ground image can be realized by adopting a perspective-n-point (PnP) based method. Specifically, given a two-dimensional to three-dimensional corresponding point and parameters in a ground camera, the camera pose is solved by RANSAC by adopting different PnP algorithms. The PnP algorithm used includes P3P, AP3P and EPnP. When at least one of the number of inliers corresponding to the algorithm exceeds 16, the pose estimation at this time can be regarded as a successful estimation at one time, and the pose of the camera is determined as the one with the largest number of inliers in the PnP result. In the RANSAC process of the present embodiment, a total of 500 random samples are taken, and the distance threshold is set to 4 px.

(2) Mobile robot positioning

When the ground robot moves in an indoor scene and collects videos, the ground robot can be roughly positioned through the wheel odometer. In this step, the ground robot is globally positioned on the aerial map by matching the ground with the composite image to correct the rough positioning result of the robot. And only the extracted ground video frames are subjected to pose correction, but not all the video frames. This is because: (1) the ground robot moves relatively slowly indoors and does not deviate from a planned path seriously in a short time; (2) each global visual localization takes about 0.5s and time is mainly spent on ASIFT feature extraction. It should be noted that for some decimated video frames, visual positioning may not always succeed due to the insufficient number of inliers for PnP.

Let c denote the position and orientation of the ground image that was last successfully located_AAnd n_AAnd the rough position and orientation of the ground image to be currently located obtained by the wheel odometer are respectively marked as c_BAnd n_B. Here, the candidate matching synthetic image of the current ground image is searched based on the rough positioning result, not based on the method of image retrieval. The process is schematically illustrated in FIG. 8, where c_AAnd n_AFor the position and orientation of the last successfully located ground image, c_BAnd n_BFor the rough position and orientation of the current ground image, the circle in the figure represents the search range, and the center of the circle is c_BRadius r_BIn the figure, the triangles represent the virtual camera pose, the light grey triangles represent the selected composite image and the dark grey triangles representThe unselected composite image. When the composite image satisfies the following two conditions, it will be matched with the current ground image: (1) the synthetic image is positioned at the center of circle c_BRadius r_BIn the circle of (1), wherein r_B＝max(‖c_B-c_A|,) and β ═ 2 m; (2) synthesizing image orientation with n_BIs less than 90 deg.. A variable radius r is used here_BThe reason for this is that the drift of the relative pose acquired by the built-in sensor of the robot becomes more and more severe as the robot moves. After the current ground image is matched with the obtained candidate matching synthetic image, the current ground image is positioned by a random sample consensus (RANSAC) method based on PnP by adopting a method similar to that in initial robot positioning. If the positioning result deviates from the rough positioning result in the position and orientation by a small enough amount (the position deviation is less than 5m and the orientation deviation is less than 30 deg. in the present embodiment), the current ground image positioning is successful. And determining that the pose of the robot is globally corrected by the current successfully positioned ground image, and resetting the pose in the wheel odometer to be the current vision-based positioning result. The ground image which is not successfully positioned in the step is repositioned in the subsequent indoor scene reconstruction process.

Step S302, extracting image frames of the ground visual angle video of the indoor scene by adopting a bag-of-words model-based self-adaptive video frame extracting method to obtain a ground visual angle image set of the indoor scene.

In this step, the image frames of the acquired ground view video of the indoor scene are extracted by the adaptive video frame extraction method based on the bag-of-word model in step S100, so as to obtain a ground view image set of the indoor scene, and the method is consistent, so that the description is omitted here.

After robot positioning and video capture, not all frames extracted from the ground video have been successfully positioned to the aerial map. However, to obtain a complete indoor scene reconstruction, it is necessary to locate and fuse all the images extracted from the (aerial and ground) video. First, a process for locating ground images that were not successfully located before the batch-type locating is proposed. Then, connecting the ground and the matching interior points of the synthetic image into the original characteristic point track, and realizing the fusion of aerial photography and ground point cloud through Binding Adjustment (BA). And finally, acquiring a complete and dense indoor scene reconstruction result by fusing the aerial photography and the ground image.

Step S401, the position of the ground camera corresponding to each image in the ground view image set in the aerial photography map is obtained.

In order to position the unsuccessfully positioned ground image in step S301, the invention provides a batch-type camera positioning process. In each camera positioning cycle, as many cameras as possible are positioned. Here, the three-dimensional spatial point among the two-dimensional to three-dimensional corresponding points used for camera positioning includes not only a spatial point reconstructed in the SfM process but also a spatial point obtained by intersecting a aerial map (three-dimensional mesh) by ray casting. Each batch-type camera positioning cycle comprises three steps: (1) camera positioning, (2) scene extension and Bundle Adjustment (BA), and (3) camera filtering, and a flowchart thereof is shown in fig. 9. Before batch camera positioning, images obtained by frame extraction from a ground video are matched, and matching points are connected into a characteristic point track. And for the characteristic point tracks of at least two successfully positioned visible images, the space coordinates of the characteristic point tracks are obtained in a triangulation mode.

Step S4011, the camera is positioned.

There are two ways to obtain two-dimensional three-dimensional corresponding points to locate the currently unsuited successful ground image: (1) the aerial map can acquire the matching points of the two-dimensional feature points in the successfully positioned images for the two-dimensional feature points in the ground images which are not successfully positioned currently. And then, rays are projected to the matching points from the successfully positioned camera optical center, and the intersection point of the projected rays and the aerial image map is the three-dimensional space point corresponding to the two-dimensional characteristic point in the current unsuccessfully positioned ground image. (2) And the ground characteristic point track is obtained by triangulation, and the corresponding two-dimensional characteristic points in the current unsuccessfully positioned ground image can be obtained through the matching result between the previous ground images. The current ground camera which is not successfully positioned can utilize the two-dimensional three-dimensional corresponding points to realize positioning by a RANSAC method based on PnP, and the positioning result adopts one of the two results with more interior points. Comparing the method for realizing camera positioning through two-dimensional three-dimensional corresponding points with the method only using any one of the two methods, the result is shown in fig. 10, which shows the batch type camera positioning result based on (1) an aerial map and a ground characteristic point track, (2) the aerial map only and (3) the ground characteristic point track only, wherein the x axis in the graph is the batch type camera positioning cycle times, and the y axis is the number of successfully positioned cameras; the corresponding y value is the number of cameras successfully positioned in step S300 when x is 0. As can be seen from fig. 10, the three methods can position the same number of cameras through several iteration loops. However, the number of iteration cycles required by the method for realizing camera positioning through two-dimensional three-dimensional corresponding points in the present embodiment is minimum (only 5 times is needed, and the other two methods respectively require 6 times and 8 times).

And step S4012, expanding the scene and BA.

After the cameras are positioned, the ground feature point trajectories are triangulated from the newly positioned cameras to achieve scene expansion. In order to improve the accuracy of the camera pose and the scene point, the spatial position of the ground feature point track obtained by triangulation and the positioned ground camera pose are optimized through BA after triangulation.

Step S4013, camera filtering.

In consideration of the robustness of the method, a one-step camera filtering operation is added to the successfully positioned camera after BA. If the position or orientation of the camera successfully positioned in the iteration loop is greatly deviated from the rough positioning result (the positioning result obtained by the wheel odometer) through BA optimization (the position deviation is more than 5m or the orientation deviation is more than 30 degrees), the positioning result is judged to be unreliable and filtered. It should be noted at this step that the camera filtered out in the current iteration loop can still be successfully positioned in the subsequent iteration loop.

The three steps are iterated until all the cameras are successfully positioned or no more cameras can be successfully positioned. The batch camera positioning process is shown in fig. 11, wherein the pyramid represents the camera pose successfully positioned. The 0 th iteration represents the camera positioning result in step S300.

And S402, connecting the ground visual angle image and the synthetic image matching point into the original aerial photography and ground characteristic point track, and generating cross-view constraint.

In order to fuse the aerial photography and the ground point cloud through the BA, the constraint between the aerial photography and the ground image needs to be introduced. Here, the cross-view constraint described above may be provided by the aerial and ground feature point trajectories generated by matching image matching points acquired by matching the ground with the composite image in step S300. The matched ground image feature points can be conveniently connected into the original ground feature point track by inquiring the serial numbers of the matched ground image feature points. However, although the composite image is generated from the aerial image, it is not so easy to join the matched composite image feature points into the original aerial feature point trajectory. This is because the feature points of the synthesized image for matching with the ground image are re-extracted on the synthesized image. In this step, the matching points of the ground and the synthetic image are expanded to the aerial view by means of ray projection and point projection, the schematic diagram of the process is shown in fig. 12, wherein C_i(i ═ 1,2,3) for an aerial camera, X_j(j ═ 1,2,3) is the spatial point corresponding to the matched feature point of the composite image, t_ijIs a point X_jAt camera C_iProjection of (a) onto_1j-t_2j-t_2jAnd (j ═ 1,2) is a characteristic point track of the jth cross-aerial image. Specifically, spatial points corresponding to matched synthetic image feature points are acquired on an aerial image map in a light projection mode, and then the acquired spatial points are projected onto a visible aerial image of the aerial image to produce aerial and ground feature point tracks.

And S403, optimizing the aerial image and the ground view image point cloud through BA.

In the step, a Ceres library is adopted, and the aerial photography and ground characteristic point tracks, the original (aerial photography and ground) characteristic point tracks and the internal and external parameters of all (aerial photography and ground) cameras generated by connection are globally optimized in a mode of minimizing back projection errors.

And S404, integrating the aerial photography and the ground image for dense reconstruction by using the aerial photography and the ground camera pose obtained through the optimization in the step S403, and obtaining a dense model of the indoor scene.

Because the cross-aerial photography and ground view constraint is introduced in the optimization process, and the aerial photography and ground images are fused in the dense reconstruction process, the reconstructed model is more complete and accurate than the model reconstructed by the image from a single source.

In order to verify the scene modeling method fusing the aerial photography and the ground perspective image according to the embodiment of the invention, the method of the embodiment is tested on two sets of data sets based on experimental equipment for acquiring aerial photography and ground metadata as shown in fig. 13 and two sets of acquired indoor scene data sets.

1. Data set

Because there are almost no aerial photography and ground image public data sets for indoor scenes at present, two sets of indoor scene data sets for method evaluation are collected in the test. Specifically, a DJI Spark mini-aircraft is used for acquiring aerial view scene, a GoPro 4 installed on a turkebot is used for acquiring ground view scene, and the metadata acquisition devices are, as shown in fig. 13, a turkebot on the ground, a DJISpark in the air, and a DJI Spark on a desktop from left to right. The acquired aerial photography and ground metadata are videos with the resolution ratio of 1080p and the frame rate of 25 FPS. Two sets of indoor scene data are collected, called Room and Hall respectively. Some information about the Room and Hall datasets is shown in table 1. Example aerial images and generated three-dimensional aerial maps in the Room and Hall datasets are shown in fig. 2 and 14, respectively. As can be seen from fig. 2 and 14, the aerial map of the Hall data set is of poorer quality and larger scale than the aerial map of the Room data set. However, as can be seen from the subsequent evaluation of the method, the method of the present invention can achieve the expected results on both of the above data sets, which indicates that the method of the present invention has better robustness and expansibility.

TABLE 1

Data set	Room	Hall
			Aerial video length/s	218	494
Ground video length/s	61	113
			Area covered/m ²	30	130

In addition, the virtual camera pose calculation and robot path planning results on the Room and Hall data sets are shown on the far right side of fig. 2 and 14, respectively. As shown in the figure, the ground plane for calculating the pose of the virtual camera and planning the path of the robot can be successfully detected by the method, and the generated virtual camera and the path of the robot cover an indoor scene more uniformly. 60 and 384 virtual cameras are generated on the Room and Hall data sets respectively by the virtual camera pose calculation method of the invention. The first three columns in FIG. 14 are exemplary aerial images and their corresponding three-dimensional aerial map regions. The fourth column is the entire three-dimensional aerial map. The fifth column is the result of robot path planning and virtual camera pose calculation on the aerial photography map, wherein the ground plane is marked as light gray, the planned path is marked as a line segment, and the virtual camera pose is represented by a pyramid.

2. Adaptive frame extraction results

By the self-adaptive frame extraction method, 271 and 112 frames of images are respectively extracted from aerial photography and ground videos of the Room data set, and 721 and 250 frames of images are extracted from aerial photography and ground videos of the Hall data set. In order to verify the effectiveness of the frame extraction method, the method and the equal-interval frame extraction method are compared and tested on the aerial video of the Hall data set. The adaptive frame extraction method of the invention extracts 721 frame images on a video with length of 494s and frame rate of 25FPS, and extracts 1 frame image (494 × 25/721 ≈ 17) every 17 frames for equal-interval frame extraction method, and extracts 730 frames of images in total. Then, the video frames obtained by two different frame extraction methods are calibrated by the open source SfM system COLMAP, and the result is shown in fig. 15, which is shown in the left diagram: COLMAP result of adaptively decimated video frames, wherein

Successfully calibrating the video frame; middle and right panels: COLMAP result of equally spaced decimated video frames, wherein

The video frame of (a) is successfully scaled but broken into two parts. The middle and right figures correspond to the two rectangular areas in the left figure, respectively. The circled portions in the left and right figures show the comparison results at the same corner. As can be seen from fig. 15, since the video frames extracted by the method of the present invention have better connectivity than the equally spaced method, a consistent aerial map can be obtained by reconstructing the same. In addition, the black circles in fig. 15 indicate that to obtain a more complete aerial map, a more intensive frame-drawing operation needs to be performed on the video at the corners.

3. Ground camera positioning results

In order to verify the batch-type camera positioning and aerial photography and ground image fusion method, qualitative and quantitative comparison is carried out on the camera positioning result and the COLMAP result after the batch-type camera positioning and aerial photography and ground image fusion. It is noted that for the COLMAP, the pose of the ground camera is not initialized, and only calibrated by the image itself, i.e. the camera positioning result by the aerial map in step S300 is not provided to the COLMAP as a priori.

Qualitative comparison results as shown in fig. 16, first row: a Room dataset result; a second row: hall dataset results; from left to right: the result after aerial photography and ground image fusion, the result after ground camera batch type positioning and the COLMAP calibration result; the rectangles in the figure indicate the wrong camera pose. As can be seen from fig. 16, for the Room data set, the camera poses obtained by the three contrast methods are relatively similar, because the scene structure of the Room data set is relatively simple. For the Hall dataset, the camera trajectory calculated by the COLMAP is clearly erroneous in the left part of the scene. This is because the matching result between the ground images includes more matching outliers due to the repeated texture and the weak texture, which may cause the incremental SfM system to generate a more obvious scene drift phenomenon. In contrast, for batch camera positioning, since part of the ground image is already initially positioned to the aerial map, there are only some slight scene drift conditions in the positioning result. And the wrong camera postures are corrected in the subsequent aerial photography and ground image fusion stage. This is because the aerial and ground feature point tracks generated by connection are introduced into the global optimization during image fusion. The above results show that locating the ground camera by fusing the aerial and ground images is more robust than using the ground image alone.

4. Indoor scene reconstruction results

Finally, the indoor scene reconstruction algorithm is qualitatively and quantitatively evaluated. This test compares the indoor reconstruction results of the present invention with the results of the reconstruction using only aerial or ground images, the qualitative comparison results are shown in fig. 17, the first column: a Room dataset result; the second column: an enlarged view of the rectangular area in the first column; third column: hall dataset results; fourth column: enlargement of the rectangular area in the third column. From top to bottom: the result of the fused aerial and ground images is used only with the ground images and only with the aerial images. It should be noted that: (1) for the indoor reconstruction algorithm, the adopted camera pose is the camera pose after the aerial photography and the ground image are fused; (2) for the method only adopting the ground image, the adopted camera pose is the camera pose after the batch type camera positioning; (3) for the method that only adopts the aerial camera, the adopted camera pose is the camera pose obtained through SfM estimation. As can be seen from fig. 17, although a partial region is inevitably absent in the reconstruction result due to the existence of occlusion and weak texture, the indoor reconstruction result by fusing the aerial image and the ground image is more complete than the indoor reconstruction result by using only a single image for reconstruction.

The scene modeling system fusing the aerial photography and the ground visual angle images comprises an aerial photography map building module, a synthetic image acquisition module, a visual angle image set acquisition module and an indoor scene model acquisition module;

and the indoor scene model acquisition module is configured to fuse the aerial photography view angle image and the ground view angle image based on the synthetic image to acquire an indoor scene model. Comprises that

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the scene modeling system fusing aerial photography and ground perspective images provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the above-mentioned scene modeling method for merging aerial and ground perspective images.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to implement the scene modeling method for fusing aerial and ground perspective images as described above.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A scene modeling method fusing aerial photography and ground perspective images is characterized by comprising the following steps:

the method for synthesizing the ground visual angle reference image by the aerial map comprises the following steps: based on the aerial photography map, calculating the pose of the virtual camera, and the method specifically comprises the following steps:

firstly, acquiring a visible grid of each aerial photo and virtual camera, then projecting the visible grid of each virtual camera onto the camera to form a two-dimensional triangle set, and determining which aerial photo image is adopted for transformation to fill the area based on visibility, definition and consistency factors for a specific two-dimensional triangle in the virtual images during virtual image synthesis; the image synthesis problem is summarized as a multi-label optimization problem as follows:

wherein E (l) is an energy function in the graph cutting process;

a public edge set of the triangles in the two-dimensional triangle set obtained by projection; l_iIs t_iThe aerial image sequence number of (1); d_i(l_i) Is a data item; v_i(l_i，l_j) Is a smoothing term;

Otherwise D_i(l_i) α, wherein

to correspond to t_iIn the l-th space patch_iProjected area in aerial image, α ═ 10⁴；

When l is_i＝l_jTime, smoothing term V_i(l_i，l_j) 0; otherwise V_i(l_i，l_j)＝1；

step S400, generating a cross-view constraint based on the synthetic image, fusing the aerial photography view angle image and the ground view angle image based on the cross-view constraint, and acquiring an indoor scene model, wherein the specific steps are as follows:

optimizing the aerial photography map and the ground visual angle image point cloud through binding adjustment;

and performing dense reconstruction by using the aerial image and the ground visual angle image through the aerial image and the ground camera pose to obtain a dense model of the indoor scene, wherein the dense model of the indoor scene is the indoor scene model.

2. The scene modeling method fusing the aerial photography and the ground perspective image according to claim 1, wherein in step S100, "the aerial photography perspective image of the indoor scene to be modeled is acquired, and the aerial photography map is constructed", and the method is as follows:

3. The scene modeling method for fusing aerial photography and ground perspective images according to claim 1, wherein in step S300, "the ground perspective image collected by the ground camera is used to obtain the ground perspective image set", and the method comprises:

4. The scene modeling method integrating the aerial photography and the ground visual angle image according to claim 3, wherein in the process that the ground robot continuously collects the ground visual angle video through a ground camera arranged on the ground robot based on the planned path, the positioning method comprises initial robot positioning and mobile robot positioning;

5. A scene modeling system fusing aerial photography and ground visual angle images is characterized by comprising an aerial photography map building module, a synthetic image acquisition module, a visual angle image set acquisition module and an indoor scene model acquisition module;

wherein E (l) is an energy function in the graph cutting process;

Otherwise D_i(l_i) α, wherein

the indoor scene model obtaining module is configured to generate a cross-view constraint based on the synthetic image, fuse the aerial photography view angle image and the ground view angle image based on the cross-view constraint, and obtain an indoor scene model, and specifically includes the following steps:

6. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method of scene modeling with fusion of aerial and ground perspective images of any of claims 1-4.

7. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement a method of scene modeling fusing aerial and ground perspective images according to any of claims 1 to 4.