CN111433819A

CN111433819A - Target scene three-dimensional reconstruction method and system and unmanned aerial vehicle

Info

Publication number: CN111433819A
Application number: CN201880073770.5A
Authority: CN
Inventors: 杨志华
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2020-07-17
Also published as: WO2020113417A1

Abstract

A method, a system and an unmanned aerial vehicle for three-dimensional reconstruction of a target scene are provided, wherein the method comprises the following steps: the method comprises the steps of obtaining an image sequence of a target scene, wherein the image sequence comprises a plurality of image frames which are continuous in time sequence (S201), obtaining a target frame and a reference frame according to the plurality of image frames which are continuous in time sequence, obtaining a depth map of the target frame based on the reference frame (S202), and fusing the depth map of the target frame to obtain a three-dimensional model of the target scene (S203). The method realizes the three-dimensional reconstruction of the target scene based on monocular vision in the unmanned aerial vehicle aerial shooting scene. The method does not need to rely on a high-price binocular vision system, is not limited by the depth of the depth sensor, and can meet the three-dimensional reconstruction requirement of the target scene under the unmanned aerial vehicle aerial shooting scene.

Description

Target scene three-dimensional reconstruction method and system and unmanned aerial vehicle

Technical Field

The embodiment of the invention relates to the technical field of unmanned aerial vehicles, in particular to a method and a system for three-dimensional reconstruction of a target scene and an unmanned aerial vehicle.

Background

With the continuous development of image processing technology, the three-dimensional reconstruction of a shooting scene by using an image sequence has become a hot problem in the fields of computer vision and photogrammetry. Three-dimensional reconstruction based on image sequences may generally comprise: three-dimensional reconstruction based on color image and Depth image (RGB-D) data, binocular-based three-dimensional reconstruction, and monocular-based three-dimensional reconstruction. Three-dimensional reconstruction based on RGB-D data, limited by the depth of the depth sensor, can generally only be used in relatively limited scenes indoors. The binocular-based three-dimensional reconstruction depends on a binocular vision system, and the hardware cost is high. Therefore, monocular-based three-dimensional reconstruction has important significance for reconstructing a three-dimensional model of a shooting scene.

The monocular-based three-dimensional reconstruction means that a single camera is adopted, a depth map is determined according to the image movement of an object on different images by moving the camera, and then the depth map is fused to realize the three-dimensional reconstruction. Due to the particularity of the aerial photography of the unmanned aerial vehicle, the existing single-purpose-based three-dimensional reconstruction method has the defects that the calculation result of a depth map is poor and the three-dimensional reconstruction error is large in the aerial photography scene of the unmanned aerial vehicle. In conclusion, a three-dimensional target scene reconstruction method capable of meeting the requirement of an unmanned aerial vehicle aerial shooting scene is needed.

Disclosure of Invention

The embodiment of the invention provides a target scene three-dimensional reconstruction method and system and an unmanned aerial vehicle, which are used for solving the problem that the existing method cannot meet the requirement of target scene three-dimensional reconstruction under an unmanned aerial vehicle aerial photography scene.

In a first aspect, an embodiment of the present invention provides a method for three-dimensional reconstruction of a target scene, including:

acquiring an image sequence of a target scene, wherein the image sequence comprises a plurality of image frames which are continuous in time sequence;

obtaining a target frame and a reference frame according to a plurality of image frames which are continuous in the time sequence, and obtaining a depth map of the target frame based on the reference frame;

and fusing the depth map of the target frame to obtain a three-dimensional model of the target scene.

In a second aspect, an embodiment of the present invention provides a target scene three-dimensional reconstruction system, including: a processor and a memory;

the memory for storing program code;

the processor, invoking the program code, when executed, is configured to:

In a third aspect, an embodiment of the present invention provides an unmanned aerial vehicle, including: a processor;

the unmanned aerial vehicle is provided with a shooting device, and the shooting device is used for shooting a target scene;

the processor is configured to perform at least one of,

In a fourth aspect, an embodiment of the present invention provides an apparatus (e.g., a chip, an integrated circuit, etc.) for three-dimensional reconstruction of a target scene, including: a memory and a processor. The memory is used for storing codes for executing the target scene three-dimensional reconstruction method. The processor is configured to call the code stored in the memory, and execute the method for three-dimensional reconstruction of a target scene according to the first aspect of the present invention.

In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, where the computer program includes at least one piece of code, where the at least one piece of code is executable by a computer to control the computer to perform the method for three-dimensional reconstruction of a target scene according to the first aspect.

In a sixth aspect, an embodiment of the present invention provides a computer program, which is configured to, when executed by a computer, implement the method for three-dimensional reconstruction of a target scene according to the first aspect.

According to the method and the system for three-dimensional reconstruction of the target scene and the unmanned aerial vehicle, provided by the embodiment of the invention, the three-dimensional model of the target scene is obtained by obtaining the image sequence of the target scene, wherein the image sequence comprises a plurality of continuous image frames in time sequence, obtaining the target frame and the reference frame according to the plurality of continuous image frames in time sequence, obtaining the depth map of the target frame based on the reference frame, and fusing the depth map of the target frame. The three-dimensional reconstruction of the target scene based on monocular vision under the unmanned aerial vehicle aerial photography scene is realized. The three-dimensional reconstruction method for the target scene provided by the embodiment does not need to rely on a high-price binocular vision system, is not limited by the depth of the depth sensor, and can meet the three-dimensional reconstruction requirement for the target scene under the unmanned aerial vehicle aerial shooting scene.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic architecture diagram of an unmanned flight system provided by an embodiment of the present invention;

FIG. 2 is a flowchart of an embodiment of a method for three-dimensional reconstruction of a target scene according to the present invention;

fig. 3 is a schematic diagram illustrating reference frame selection in an embodiment of a method for three-dimensional reconstruction of a target scene according to the present invention;

FIG. 4 is a schematic block diagram of an embodiment of a method for three-dimensional reconstruction of a target scene according to the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of a three-dimensional reconstruction system of a target scene according to the present invention;

fig. 6 is a schematic structural diagram of an embodiment of the unmanned aerial vehicle provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When a component is referred to as being "connected" to another component, it can be directly connected to the other component or intervening components may also be present.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

The embodiment of the invention provides a target scene three-dimensional reconstruction method and system and an unmanned aerial vehicle. Where the drone may be, for example, a rotorcraft (rotorcraft), such as a multi-rotor aircraft propelled through air by a plurality of propulsion devices, embodiments of the invention are not limited in this regard.

Fig. 1 is a schematic architecture diagram of an unmanned aerial vehicle system provided by an embodiment of the invention. The present embodiment is described by taking a rotor unmanned aerial vehicle as an example.

The unmanned flight system 100 can include a drone 110, a display device 130, and a control terminal 140. The drone 110 may include, among other things, a power system 150, a flight control system 160, a frame, and a pan-tilt 120 carried on the frame. The drone 110 may be in wireless communication with the control terminal 140 and the display device 130.

The airframe may include a fuselage and a foot rest (also referred to as a landing gear). The fuselage may include a central frame and one or more arms connected to the central frame, the one or more arms extending radially from the central frame. The foot rest is connected with the fuselage for play the supporting role when unmanned aerial vehicle 110 lands.

The power system 150 may include one or more electronic governors (abbreviated as electric governors) 151, one or more propellers 153, and one or more motors 152 corresponding to the one or more propellers 153, wherein the motors 152 are connected between the electronic governors 151 and the propellers 153, the motors 152 and the propellers 153 are disposed on the horn of the drone 110; the electronic governor 151 is configured to receive a drive signal generated by the flight control system 160 and provide a drive current to the motor 152 based on the drive signal to control the rotational speed of the motor 152. The motor 152 is used to drive the propeller in rotation, thereby providing power for the flight of the drone 110, which power enables the drone 110 to achieve one or more degrees of freedom of motion. In certain embodiments, the drone 110 may rotate about one or more axes of rotation. For example, the above-mentioned rotation axes may include a Roll axis (Roll), a Yaw axis (Yaw) and a pitch axis (pitch). It should be understood that the motor 152 may be a dc motor or an ac motor. The motor 152 may be a brushless motor or a brush motor.

Flight control system 160 may include a flight controller 161 and a sensing system 162. The sensing system 162 is used to measure attitude information of the drone, i.e., position information and state information of the drone 110 in space, such as three-dimensional position, three-dimensional angle, three-dimensional velocity, three-dimensional acceleration, three-dimensional angular velocity, and the like. The sensing system 162 may include, for example, at least one of a gyroscope, an ultrasonic sensor, an electronic compass, an Inertial Measurement Unit (IMU), a vision sensor, a global navigation satellite system, and a barometer. For example, the Global navigation satellite System may be a Global Positioning System (GPS). The flight controller 161 is used to control the flight of the drone 110, for example, the flight of the drone 110 may be controlled according to attitude information measured by the sensing system 162. It should be understood that the flight controller 161 may control the drone 110 according to preprogrammed instructions, or may control the drone 110 in response to one or more control instructions from the control terminal 140.

The pan/tilt head 120 may include a motor 122. The pan/tilt head is used to carry the photographing device 123. Flight controller 161 may control the movement of pan/tilt head 120 via motor 122. Optionally, as another embodiment, the pan/tilt head 120 may further include a controller for controlling the movement of the pan/tilt head 120 by controlling the motor 122. It should be understood that the pan/tilt head 120 may be separate from the drone 110, or may be part of the drone 110. It should be understood that the motor 122 may be a dc motor or an ac motor. The motor 122 may be a brushless motor or a brush motor. It should also be understood that the pan/tilt head may be located at the top of the drone, as well as at the bottom of the drone.

The photographing device 123 may be, for example, a device for capturing an image such as a camera or a video camera, and the photographing device 123 may communicate with the flight controller and perform photographing under the control of the flight controller. The image capturing Device 123 of this embodiment at least includes a photosensitive element, such as a Complementary Metal Oxide Semiconductor (CMOS) sensor or a Charge-coupled Device (CCD) sensor. It can be understood that the camera 123 may also be directly fixed to the drone 110, such that the pan/tilt head 120 may be omitted.

The display device 130 is located at the ground end of the unmanned aerial vehicle system 100, can communicate with the unmanned aerial vehicle 110 in a wireless manner, and can be used for displaying attitude information of the unmanned aerial vehicle 110. In addition, an image taken by the imaging device may also be displayed on the display apparatus 130. It should be understood that the display device 130 may be a stand-alone device or may be integrated into the control terminal 140.

The control terminal 140 is located at the ground end of the unmanned aerial vehicle system 100, and can communicate with the unmanned aerial vehicle 110 in a wireless manner, so as to remotely control the unmanned aerial vehicle 110.

In addition, the unmanned aerial vehicle 110 may also have a speaker (not shown in the figure) mounted thereon, and the speaker is used for playing audio files, and the speaker may be directly fixed on the unmanned aerial vehicle 110, or may be mounted on the cradle head 120.

The photographing device 123 in the present embodiment may be, for example, a monocular camera for photographing a target scene to acquire an image sequence of the target scene. The three-dimensional reconstruction method for the target scene provided in the following embodiment may be executed by the flight controller 161, for example, the flight controller 161 obtains a precise image sequence of the target through the shooting device 123, so as to realize three-dimensional reconstruction for the target scene, and may be used for unmanned aerial vehicle flight obstacle avoidance; the three-dimensional reconstruction method of the target scene may also be executed by the control terminal 140 located at the ground end, for example, the unmanned aerial vehicle transmits the image sequence of the target scene acquired by the photographing device 123 to the control terminal 140 through a graph transmission technology, and the control terminal 140 completes the three-dimensional reconstruction of the target scene; the target scene three-dimensional reconstruction method may also be executed by a cloud server (not shown in the figure) located at the cloud end, for example, the unmanned aerial vehicle transmits the image sequence of the target scene acquired by the photographing device 123 to the cloud server through a graph transmission technology, and the cloud server completes three-dimensional reconstruction of the target scene.

It should be understood that the above-mentioned nomenclature for the components of the unmanned flight system is for identification purposes only, and should not be construed as limiting embodiments of the present invention.

Fig. 2 is a flowchart of an embodiment of a method for three-dimensional reconstruction of a target scene according to the present invention. As shown in fig. 2, the method provided by this embodiment may include:

s201, acquiring an image sequence of a target scene, wherein the image sequence comprises a plurality of image frames which are continuous in time sequence.

In this embodiment, for example, an unmanned aerial vehicle equipped with a monocular camera device may be used to capture a target scene to obtain an image sequence of the target scene.

The target scene is an object needing three-dimensional reconstruction. In this embodiment, after the target scene is determined, a flight route can be planned for the unmanned aerial vehicle, and the flight speed and the shooting frame rate are set to acquire an image sequence of the target scene, or the shooting place can be set, and when the unmanned aerial vehicle flies to a preset shooting place, shooting is performed.

The image sequence of the target scene acquired in this embodiment includes a plurality of image frames that are consecutive in time sequence.

S202, obtaining a target frame and a reference frame according to a plurality of image frames which are continuous in time sequence, and obtaining a depth map of the target frame based on the reference frame.

In this embodiment, after the image sequence of the target scene is acquired, in order to implement three-dimensional reconstruction of the target scene, the target frame and the reference frame need to be determined according to a plurality of image frames which are continuous in time sequence. The target frame is an image frame which needs depth recovery for realizing three-dimensional reconstruction, and the reference frame is an image acquisition frame which provides data such as depth of field and the like for the target frame and has time domain correlation and pixel correlation with the target frame.

Alternatively, the target frame in the present embodiment may include one frame of a plurality of image frames that are consecutive in time sequence.

Alternatively, the reference frame in this embodiment may include a frame having overlapping pixels with the target frame.

In the present embodiment, for example, feature extraction, feature point matching, pose estimation, and the like may be performed on a plurality of acquired image frames that are consecutive in time sequence to determine a target frame and a reference frame. To improve accuracy, Features with rotation invariance may be selected, such as Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and so on. In this embodiment, the pose estimation of each image frame during shooting may be obtained by a sensor mounted on the unmanned aerial vehicle, such as a odometer, a gyroscope, an IMU, and the like.

In this embodiment, after the target frame and the reference frame are determined, the depth map corresponding to the target frame may be obtained based on the image data of the reference frame according to the feature point matching between the target frame and the reference frame and the knowledge of epipolar geometry.

And S203, fusing the depth map of the target frame to obtain a three-dimensional model of the target scene.

In this embodiment, after the depth map of the target frame is obtained, the depth map is converted into a corresponding three-dimensional point cloud according to the depth value and the position information corresponding to each pixel point in the target frame. And performing three-dimensional reconstruction on the target scene according to the three-dimensional point cloud.

In the method for reconstructing a three-dimensional target scene provided by this embodiment, an image sequence of the target scene is obtained, where the image sequence includes a plurality of image frames that are consecutive in time sequence, a target frame and a reference frame are obtained according to the plurality of image frames that are consecutive in time sequence, a depth map of the target frame is obtained based on the reference frame, and a three-dimensional model of the target scene is obtained by fusing the depth map of the target frame. The three-dimensional reconstruction of the target scene based on monocular vision under the unmanned aerial vehicle aerial photography scene is realized. The three-dimensional reconstruction method for the target scene provided by the embodiment does not need to rely on a high-price binocular vision system, is not limited by the depth of the depth sensor, and can meet the three-dimensional reconstruction requirement for the target scene under the unmanned aerial vehicle aerial shooting scene. Of course, the use of a monocular imaging device in the embodiments of the present description does not mean that the method of the present description is not applicable to a binocular imaging device, and actually, a binocular imaging device or a monocular imaging device is also applicable to the solutions described in the present description.

On the basis of the foregoing embodiment, in order to obtain a more accurate depth map of the target frame to improve the accuracy of three-dimensional reconstruction of the target scene, in the method for three-dimensional reconstruction of the target scene provided by this embodiment, the reference frame may include at least a first image frame and a second image frame. Wherein a first image frame temporally precedes the target frame and a second image frame temporally follows the target frame.

When the unmanned aerial vehicle takes an aerial photo, the unmanned aerial vehicle can fly along a planned air route. When the drone is flying along a flight path, there is a substantial portion of the area in the current image frame that was not present in the previously captured image frame. That is, if the reference frame includes only an image frame captured before the current image frame, and a depth map of the current image frame is determined from the reference frame, there will be a considerable number of areas without disparity solutions, and a large number of invalid areas will inevitably exist in the depth map.

Therefore, in order to avoid that a region in the target frame has no corresponding matching region in the reference frame and the depth map corresponding to the region is invalid, the reference frame in this embodiment includes both a first image frame located before the reference frame in time sequence and a second image frame located after the reference frame in time sequence, so that the overlapping rate between the target frame and the reference frame is improved, the region where the parallax has no solution is reduced, and the accuracy of the depth map of the target frame obtained based on the reference frame is improved.

Optionally, if the target frame is the nth frame, the first image frame is the (N-1) th frame, and the second image frame is the (N + 1) th frame, that is, the reference frame includes two frames before and after the target frame. For example, if the unmanned aerial vehicle is in aerial photography, the overlapping rate between two adjacent frames is 70%, and if the reference frame only includes the image frame before the target frame, the disparity of at least 30% of the area in the target frame is not solved. The reference frame selection strategy provided by this embodiment enables all regions in the target frame to find regions matched with the reference frame, thereby avoiding the generation of parallax unsolved phenomenon and improving the accuracy of the depth map of the target frame.

Optionally, if the target frame is an nth frame, the first image frame may include a preset number of image frames before the nth frame, and the second image frame may include a preset number of image frames after the nth frame.

Optionally, if the target frame is an nth frame, the first image frame may be one of a preset number of image frames before the nth frame, and the second image frame may be one of a preset number of image frames after the nth frame.

On the basis of any of the foregoing embodiments, in order to improve the reliability of the depth map of the target frame and improve the reliability of the three-dimensional reconstruction of the target scene, in the method for three-dimensional reconstruction of a target scene provided by this embodiment, the reference frame may include at least the third image frame. And the polar line direction of the third image frame and the target frame are not parallel.

The epipolar lines in this embodiment are the epipolar lines in the epipolar geometry, i.e., the intersection between the polar plane and the image. The polar line direction of the third image frame is not parallel to that of the target frame, that is, the first intersection line of the polar plane and the third image frame is not parallel to the second intersection line of the polar plane and the target frame.

When repeated textures exist in the target frame, if the polar line directions of the target frame and the reference frame are parallel, repeated textures distributed along the parallel polar lines occur, and the reliability of the depth map corresponding to the region is reduced. Therefore, the third image frame which is not parallel to the polar line direction of the target frame is selected as the reference frame, so that the phenomenon that repeated textures are distributed along the parallel polar lines is avoided, and the reliability of the depth map is improved.

Optionally, the third image frame may include an image frame having overlapping pixels with the target frame in the target frame adjacent swath.

Optionally, the third image frame may be an image frame with the highest overlapping rate with the target frame in an adjacent flight zone of the target frame.

The following describes a method for selecting a reference frame according to an embodiment of the present invention by using a specific example. Fig. 3 is a schematic diagram illustrating reference frame selection in an embodiment of a target scene three-dimensional reconstruction method provided by the present invention. As shown in fig. 3, the solid line is used to represent a flight path of the unmanned aerial vehicle, the flight path covers the target scene, the arrow represents a flight direction of the unmanned aerial vehicle, and the black circle and the black square on the flight path represent that the shooting device of the unmanned aerial vehicle shoots at the position, that is, the black circle and the black square correspond to one image frame of the target scene. When unmanned aerial vehicle flies along the flight path, through the shooting device of carrying on unmanned aerial vehicle, like the monocular camera, alright in order to acquire the image sequence of target scene, contained a plurality of continuous image frames on the chronogenesis. M-1, M, M +1, N-1, N, N +1 in FIG. 3 indicate frame numbers of image frames, N and M are natural numbers, and the specific values of N and M are not limited in this embodiment.

If the nth frame represented by the black square is the target frame, in one possible implementation, the reference frames may include the nth-1 frame and the (N + 1) th frame shown in the figure.

If the nth frame represented by the black square is the target frame, in another possible implementation, the reference frame may include the mth frame shown in the figure.

If the nth frame represented by the black square is the target frame, in another possible implementation manner, the reference frame may include the mth frame, the nth-1 frame, and the (N + 1) th frame shown in the figure, i.e., the image frame included in the dashed circle in fig. 3.

It will be appreciated that the reference frames may also include more image frames, for example, the M-1 th frame, the M +1 th frame, the N-2 th frame, etc. In specific implementation, the overlapping rate of the target frame and the reference frame and the calculation speed can be comprehensively considered for selection.

In some embodiments, one implementation of obtaining a depth map of a target frame based on a reference frame may be: and obtaining a depth map of the target frame according to the aberration between the target frame and the reference frame.

In this embodiment, the depth map of the target frame may be obtained according to the aberration of the same object in the target frame and the reference frame.

In some embodiments, one implementation of obtaining a depth map of a target frame based on a reference frame may be: determining the matching cost corresponding to the target frame according to the reference frame; and determining the depth map of the target frame according to the matching cost corresponding to the target frame.

In this embodiment, matching costs corresponding to the target frame may be determined by matching the pixel points in the reference frame and the target frame. After the matching cost corresponding to the target frame is determined, matching cost aggregation may be performed, then the disparity is determined, and the depth map of the target frame is determined according to the correspondence between the disparity and the depth. Optionally, after determining the disparity, disparity optimization and disparity enhancement may be performed. And determining the depth map of the target frame according to the optimized and enhanced parallax.

The flying height of the unmanned aerial vehicle is about 100 meters generally, and the unmanned aerial vehicle shoots downwards vertically generally, and due to the fact that the ground is high and low, reflection of sunlight has difference, images shot by the unmanned aerial vehicle have non-negligible illumination change, and the illumination change reduces the accuracy of three-dimensional reconstruction of a target scene.

On the basis of any one of the foregoing embodiments, in order to improve robustness of three-dimensional reconstruction of a target scene to illumination, in the method for three-dimensional reconstruction of a target scene provided in this embodiment, determining a matching cost corresponding to a target frame according to a reference frame may include: determining a first type matching cost and a second type matching cost corresponding to the target frame according to the target frame and the reference frame; and determining that the matching cost corresponding to the target frame is equal to the weighted sum of the first type matching cost and the second type matching cost.

In the embodiment, when the matching cost is calculated, the first type matching cost and the second type matching cost are fused, so that compared with the case that only a single type matching cost is adopted, the robustness of the matching cost on illumination is improved, the influence of illumination change on three-dimensional reconstruction is reduced, and the accuracy of the three-dimensional reconstruction is improved. In this embodiment, the weighting coefficients of the first type matching cost and the second type matching cost may be set according to specific needs, which is not limited in this embodiment.

Alternatively, the first type of matching cost may be determined based on Zero-based Normalized Cross Correlation (ZNCC). Based on ZNCC, the similarity between the target frame and the reference frame can be accurately measured.

In this embodiment, illumination invariant features, such as local Binary Patterns (L ocal Binary Patterns, L BP), census sequences, etc., in image frames acquired by the drones may be extracted, and then the second type matching cost may be determined based on the illumination invariant features.

The census sequence in this embodiment may be determined by selecting any point in the image frame, drawing a rectangle, e.g., 3 × 3, centered on the point, comparing each point in the rectangle except the center point with the center point, and noting 1 if the gray value is smaller than the center point and 0 if the gray value is greater than the center point, and taking the resulting sequence of only 0 and 1 with a length of 8 as the census sequence for the center point, i.e., the gray value of the center pixel is replaced by the census sequence.

After census transformation, hamming distance may be used to determine a second type of matching cost between the target frame and the reference frame.

For example, the matching cost corresponding to the target frame may be equal to a weighted sum of the two matching costs ZNCC and census.

In some embodiments, one implementation of determining the matching cost corresponding to the target frame according to the reference frame may be: dividing a target frame into a plurality of image blocks; determining the matching cost corresponding to each image block according to the reference frame; and determining the matching cost corresponding to the target frame according to the matching cost corresponding to each image block.

In this embodiment, the target frame may be divided into a plurality of image blocks by one or more of the following ways:

(1) and dividing the target frame into a plurality of image blocks by adopting a clustering mode. In this embodiment, for example, the target frame may be divided into a plurality of image blocks in a clustering manner according to the color information and/or texture information of the target frame.

(2) The target frame is evenly divided into a plurality of image blocks. In this embodiment, for example, the number of image blocks may be preset, and then the target frame may be divided according to the preset number of image blocks.

(3) The target frame is divided into a plurality of image blocks of a preset size. For example, the size of the image block may be preset, and then the target frame may be divided according to the preset size of the image block.

Optionally, after the target frame is divided into a plurality of image blocks, the matching cost corresponding to each image block may be determined in parallel according to the reference frame. In this embodiment, for example, a software and/or hardware manner may be adopted to determine the matching cost corresponding to each image block in parallel. Specifically, for example, the matching cost corresponding to each tile may be determined in parallel by using multiple threads, and/or the matching cost corresponding to each tile may be determined in parallel by using a Graphics Processing Unit (GPU).

According to the method for three-dimensional reconstruction of the target scene, on the basis of the embodiment, the target frame is divided into the plurality of image blocks, the matching cost corresponding to each image block is determined in parallel according to the reference frame, and then the matching cost corresponding to the target frame is determined according to the matching cost corresponding to each image block, so that the calculation speed of the matching cost is increased, and the real-time performance of three-dimensional reconstruction of the target scene is improved.

The depth sampling frequency can be determined according to the depth range and the precision, and the depth sampling frequency is positively correlated with the depth range and negatively correlated with the precision. For example, if the depth range is 50 meters and the accuracy requirement is 0.1 meters, the number of depth samples may be 500.

When the matching cost of the target frame is determined, a preset depth sampling frequency can be adopted, or instant positioning and Mapping (S L AM) can be adopted to restore some sparse three-dimensional points in the target frame, then the depth range of the whole target frame is determined according to the sparse three-dimensional points, and then the depth sampling frequency is determined according to the depth range and the precision requirement of the whole target frame.

On the basis of any of the foregoing embodiments, in order to further increase the processing speed and improve the real-time performance of three-dimensional reconstruction of a target scene, in the method for three-dimensional reconstruction of a target scene provided in this embodiment, determining the matching cost corresponding to each image block according to a reference frame may include: determining the depth sampling times of each image block according to the sparse points in the image block; and determining the matching cost corresponding to each image block according to the reference frame and the depth sampling times of each image block.

It should be noted that, when the unmanned aerial vehicle shoots vertically downwards, the target frame may include various shooting objects, such as pedestrians, automobiles, trees, tall buildings, and the like, so the depth range of the whole target frame is relatively large, and the depth sampling frequency is relatively large under the requirement of preset accuracy. However, the depth range corresponding to each image block in the target frame is relatively small, for example, when only a pedestrian is included in one image block, the depth range corresponding to the image block is much smaller than that of the whole target frame, and the number of depth sampling times can be greatly reduced under the same precision requirement. That is, under the same precision requirement, the number of depth samples of the image block in the target frame is necessarily less than or equal to the number of depth samples of the entire target frame.

According to the depth sampling method and device, the depth range of each image block is fully considered, the depth sampling times are set according to the depth range of each image block, the calculation complexity is reduced on the premise that the precision is guaranteed, and the speed is increased.

The present embodiment may, for each image block, recover some sparse three-dimensional points in the image block by using S L AM, determine a depth range of the image block according to the sparse three-dimensional points, and determine the depth sampling times of the image block according to the depth range and the precision requirement of the image block.

How to reduce the computational complexity and increase the processing speed of the method provided by the embodiment is described by specific numerical analysis as follows:

if the target frame is an image frame with a size of 640 × 480 pixels, the depth sampling time is determined to be 500 according to the depth range of the target frame, and then the matching cost needs to be calculated for 640 × 480 × 500 times. If the target frame is uniformly divided into 320 × 160 image blocks, and the depth sampling times of the 6 image blocks determined according to the depth ranges of the respective image blocks are 100, 200, 150, 100, 50, and 300, respectively, only 320 × 160 (100+200+150+100+150+300) matching costs need to be calculated. The calculation amount is only one third of the original amount.

Optionally, after the Matching cost corresponding to the target frame is determined, the depth map of the target frame may be determined according to a Semi Global Matching algorithm (SGM).

It can be understood that, due to the large deviation of the normal vector in the target scene and the inherent discreteness of depth sampling, and the weak texture and the repeated texture, the depth map of the target frame inevitably has a large amount of randomly distributed noise.

On the basis of any of the foregoing embodiments, in order to avoid reducing the accuracy of three-dimensional reconstruction by noise in the depth map, in the method for three-dimensional reconstruction of a target scene provided in this embodiment, after obtaining the depth map of the target frame based on the reference frame, the method may further include: and carrying out filtering processing on the depth map of the target frame. By filtering the depth map of the target frame, noise in the depth map can be filtered, and the accuracy of three-dimensional reconstruction is improved.

Optionally, one implementation manner of performing filtering processing on the depth map of the target frame may be: and carrying out trilateral filtering processing on the depth map of the target frame. The trilateral filtering in this embodiment means that the weighting coefficient in the filtering process can be determined comprehensively according to three factors, namely, the pixel distance, the depth difference value and the color difference value.

For example, in a filtering process, the size of the filtering template is 5 × 5, that is, the depth value of the target pixel after the filtering process can be determined by the depth values of the pixel and the surrounding 24 pixels. The weighted value of each pixel point for the influence of the depth value of the target pixel point is determined according to the Euclidean distance between the pixel point and the target pixel point, the difference value between the depth value of the pixel point and the depth value of the target pixel point, and the difference value between the RGB value of the pixel point and the RGB value of the target pixel point.

On the basis of the above embodiment, the method for three-dimensional reconstruction of a target scene further performs trilateral filtering processing on the depth map of the target frame, improves the accuracy of the edge of the depth map of the target frame through sharp and fine edge information in the target frame, and removes noise more robustly on the premise of preserving the edge, so that the depth map of the target frame is more accurate, and the three-dimensional reconstruction based on the depth map is more accurate.

In some embodiments, one implementation of fusing the depth map of the target frame to obtain the three-dimensional model of the target scene may be: determining a point cloud corresponding to the target frame according to the depth map of the target frame; fusing point clouds corresponding to the target frames into voxels corresponding to the target scene; and obtaining a three-dimensional model of the target scene according to the voxels corresponding to the target scene.

In this embodiment, the point cloud corresponding to the target frame is determined according to the depth map of the target frame, and the corresponding relationship between the depth map and the three-dimensional point cloud in the prior art may be used for conversion, which is not limited in this embodiment.

In the embodiment, a point cloud fusion method based on voxels is adopted. When the unmanned aerial vehicle works, the air route is planned before the unmanned aerial vehicle takes off, and the unmanned aerial vehicles shoot vertically downwards, so that the coverage area of the planned air route can be represented by voxels with preset sizes, after each frame of depth map is converted into point clouds, the point clouds can be positioned in corresponding voxels according to the three-dimensional coordinates of the point clouds, the normal vectors of the point clouds are fused into the normal vectors of the voxels, the coordinates of the point clouds are fused into the coordinates of the voxels, and the visibility information is stored by the voxels.

The method for three-dimensional reconstruction of the target scene is high in real-time performance and expandability, the depth map is fused almost pixel by pixel in parallel, the depth value is added to the point cloud, the calculation complexity of the point cloud fused voxel is o (1), the fusion real-time performance is very high, and for a one-time planning task, the target area can be partitioned into a plurality of sub-blocks according to the particularity of route planning, so that the point cloud is good in blocking performance, the point cloud is favorable for loading and subsequent multi-Detail level (L evens of Detail, L OD) display, and real-time three-dimensional reconstruction of a large scene is facilitated.

Fig. 4 is a schematic block diagram of an embodiment of a target scene three-dimensional reconstruction method provided by the present invention. As shown in fig. 4, the method for three-dimensional reconstruction of a target scene provided by this embodiment may be implemented by two threads, namely, a densification thread and a fusion thread. The method comprises the steps of carrying out the operation of the depth map calculation, carrying out the operation of the. For example, a new keyframe and its position and orientation may be acquired by a camera mounted on the drone, and the initialization process may be completed based on the acquired new keyframe and its position and orientation. After the initialization is completed, frame selection is performed, where frame selection in this embodiment includes a target frame and a reference frame, and a specific implementation manner may refer to the determination method of the target frame and the reference frame in any of the above embodiments, and details are not described here. The depth map calculation in this embodiment may adopt a method of combining a Plane scanning algorithm (Plane scanning) and Semi-Global optimization (SGM), for example, or refer to an implementation manner of determining a depth map according to a Matching cost in any of the embodiments described above, which is not described herein again. The filtering process in this embodiment may adopt, for example, a trilateral filtering process, and specific implementation manners may refer to the above embodiments, which are not described herein again. After the processing of the densification thread is completed, the fusion thread in the embodiment completes the fusion of the depth map according to the RGB map, the position orientation, and the depth map queue, and creates a three-dimensional point cloud according to the fused depth map. The densification thread and the fusion thread in the embodiment can be executed in parallel, so that the three-dimensional reconstruction speed of the target scene is increased, and the real-time performance is improved.

Fig. 5 is a schematic structural diagram of an embodiment of a target scene three-dimensional reconstruction system provided by the present invention. As shown in fig. 5, the three-dimensional reconstruction system 500 of the target scene provided in this embodiment may include: a processor 501 and a memory 502. The processor 501 and the memory 502 are communicatively connected by a bus. The Processor 501 may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The Memory 502 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

A memory 502 for storing program code; the processor 501 calls program code, which when executed, performs the following:

obtaining a target frame and a reference frame according to a plurality of image frames which are continuous in time sequence, and obtaining a depth map of the target frame based on the reference frame;

Alternatively, the target frame includes one frame of a plurality of image frames that are consecutive in time sequence.

Optionally, the reference frame comprises a frame having overlapping pixels with the target frame.

Optionally, the reference frame includes at least a first image frame and a second image frame; the first image frame temporally precedes the target frame; the second image frame is chronologically subsequent to the target frame.

Optionally, if the target frame is the nth frame, the first image frame is the (N-1) th frame, and the second image frame is the (N + 1) th frame.

Optionally, the reference frame includes at least a third image frame; the third image frame is not parallel to the epipolar direction of the target frame.

Optionally, the processor 501 is configured to obtain a depth map of the target frame based on the reference frame, and specifically may include:

and obtaining a depth map of the target frame according to the aberration between the target frame and the reference frame.

determining the matching cost corresponding to the target frame according to the reference frame;

and determining the depth map of the target frame according to the matching cost corresponding to the target frame.

Optionally, the processor 501 is configured to determine a matching cost corresponding to the target frame according to the reference frame, and specifically may include:

determining a first type matching cost and a second type matching cost corresponding to the target frame according to the target frame and the reference frame;

and determining that the matching cost corresponding to the target frame is equal to the weighted sum of the first type matching cost and the second type matching cost.

Optionally, the first type of matching cost is determined based on zero-mean normalized cross-correlation.

Optionally, the second type matching cost is determined based on the illumination invariant feature.

dividing a target frame into a plurality of image blocks;

determining the matching cost corresponding to each image block according to the reference frame;

and determining the matching cost corresponding to the target frame according to the matching cost corresponding to each image block.

Optionally, the processor 501 is configured to divide the target frame into a plurality of image blocks, and specifically may include:

and dividing the target frame into a plurality of image blocks by adopting a clustering mode.

the target frame is evenly divided into a plurality of image blocks.

Optionally, the processor 501 is configured to determine a matching cost corresponding to each image block according to the reference frame, and specifically may include:

and according to the reference frame, determining the matching cost corresponding to each image block in parallel.

determining the depth sampling times of each image block according to the sparse points in the image block;

and determining the matching cost corresponding to each image block according to the reference frame and the depth sampling times of each image block.

Optionally, the processor 501 is further configured to perform a filtering process on the depth map of the target frame after obtaining the depth map of the target frame based on the reference frame.

Optionally, the processor 501 is configured to perform filtering processing on the depth map of the target frame, and specifically may include:

and carrying out trilateral filtering processing on the depth map of the target frame.

Optionally, the processor 501 is configured to fuse the depth map of the target frame to obtain a three-dimensional model of the target scene, and specifically may include:

determining a point cloud corresponding to the target frame according to the depth map of the target frame;

fusing point clouds corresponding to the target frames into voxels corresponding to the target scene;

and obtaining a three-dimensional model of the target scene according to the voxels corresponding to the target scene.

Fig. 6 is a schematic structural diagram of an embodiment of the unmanned aerial vehicle provided in the present invention. As shown in fig. 6, the drone 600 provided by the present embodiment may include a processor 601. The Processor 601 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The unmanned aerial vehicle 600 is mounted with an imaging device 602, and the imaging device 602 is used to image a target scene.

The processor 601 is configured to obtain an image sequence of a target scene, where the image sequence includes a plurality of image frames that are consecutive in time sequence;

Optionally, the processor 601 is configured to obtain a depth map of the target frame based on the reference frame, and specifically may include:

Optionally, the processor 601 is configured to determine a matching cost corresponding to the target frame according to the reference frame, and specifically may include:

dividing a target frame into a plurality of image blocks;

Optionally, the processor 601 is configured to divide the target frame into a plurality of image blocks, and specifically may include:

the target frame is evenly divided into a plurality of image blocks.

Optionally, the processor 601 is configured to determine a matching cost corresponding to each image block according to the reference frame, and specifically may include:

Optionally, the processor 601 is further configured to perform a filtering process on the depth map of the target frame after obtaining the depth map of the target frame based on the reference frame.

Optionally, the processor 601 is configured to perform filtering processing on the depth map of the target frame, and specifically may include:

Optionally, the processor 601 is configured to fuse the depth map of the target frame to obtain a three-dimensional model of the target scene, and specifically may include:

An embodiment of the present invention further provides a device (e.g., a chip, an integrated circuit, etc.) for three-dimensional reconstruction of a target scene, including: a memory and a processor. The memory is used for storing codes for executing the target scene three-dimensional reconstruction method. The processor is configured to call the code stored in the memory, and execute the method for three-dimensional reconstruction of a target scene according to any one of the method embodiments.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes at least one code, and the at least one code is executable by a computer to control the computer to perform the method for three-dimensional reconstruction of a target scene according to any of the above method embodiments.

An embodiment of the present invention provides a computer program, which is used to implement the method for three-dimensional reconstruction of a target scene according to any one of the above method embodiments when the computer program is executed by a computer.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the program may be stored in a computer readable storage medium, and when executed, performs the steps including the above method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

A method for three-dimensional reconstruction of a target scene, comprising:

acquiring an image sequence of a target scene, wherein the image sequence comprises a plurality of image frames which are continuous in time sequence;

obtaining a target frame and a reference frame according to a plurality of image frames which are continuous in the time sequence, and obtaining a depth map of the target frame based on the reference frame;

and fusing the depth map of the target frame to obtain a three-dimensional model of the target scene.
The method of claim 1, wherein the target frame comprises one of a plurality of image frames that are temporally consecutive.
The method of claim 2, wherein the reference frame comprises a frame having overlapping pixels with the target frame.
The method of claim 1, wherein the reference frame comprises at least a first image frame and a second image frame;

the first image frame temporally precedes the target frame;

the second image frame is temporally subsequent to the target frame.
The method of claim 4, wherein if the target frame is the Nth frame, the first image frame is the (N-1) th frame, and the second image frame is the (N + 1) th frame.
The method of claim 1, wherein the reference frame comprises at least a third image frame;

the third image frame is not parallel to an epipolar direction of the target frame.
The method according to any of claims 1-6, wherein said obtaining the depth map of the target frame based on the reference frame comprises:

and obtaining a depth map of the target frame according to the aberration between the target frame and the reference frame.
The method of claim 1, wherein the obtaining the depth map of the target frame based on the reference frame comprises:

determining a matching cost corresponding to the target frame according to the reference frame;

and determining the depth map of the target frame according to the matching cost corresponding to the target frame.
The method of claim 8, wherein the determining the matching cost corresponding to the target frame according to the reference frame comprises:

determining a first type matching cost and a second type matching cost corresponding to the target frame according to the target frame and the reference frame;

and determining that the matching cost corresponding to the target frame is equal to the weighted sum of the first type matching cost and the second type matching cost.
The method of claim 9, wherein the first type matching cost is determined based on a zero-mean normalized cross-correlation.
The method of claim 9, wherein the second type matching cost is determined based on illumination invariant features.
The method of claim 8, wherein the determining the matching cost corresponding to the target frame according to the reference frame comprises:

dividing the target frame into a plurality of image blocks;

determining the matching cost corresponding to each image block according to the reference frame;

and determining the matching cost corresponding to the target frame according to the matching cost corresponding to each image block.
The method of claim 12, wherein the dividing the target frame into a plurality of tiles comprises:

and dividing the target frame into a plurality of image blocks in a clustering mode.
The method of claim 12, wherein the dividing the target frame into a plurality of tiles comprises:

the target frame is evenly divided into a plurality of image blocks.
The method according to claim 12, wherein the determining the matching cost corresponding to each image block according to the reference frame comprises:

and according to the reference frame, determining the matching cost corresponding to each image block in parallel.
The method according to claim 12, wherein the determining the matching cost corresponding to each image block according to the reference frame comprises:

determining the depth sampling times of each image block according to the sparse points in the image block;

and determining the matching cost corresponding to each image block according to the reference frame and the depth sampling times of each image block.
The method of claim 1, further comprising, after the obtaining the depth map of the target frame based on the reference frame:

and carrying out filtering processing on the depth map of the target frame.
The method of claim 17, wherein the filtering the depth map of the target frame comprises:

and carrying out trilateral filtering processing on the depth map of the target frame.
The method of claim 1, wherein said fusing the depth map of the target frame to obtain the three-dimensional model of the target scene comprises:

determining a point cloud corresponding to the target frame according to the depth map of the target frame;

fusing the point cloud corresponding to the target frame into the voxel corresponding to the target scene;

and obtaining a three-dimensional model of the target scene according to the voxel corresponding to the target scene.
A system for three-dimensional reconstruction of an object scene, comprising: a processor and a memory;

the memory for storing program code;

the processor, invoking the program code, when executed, is configured to:

acquiring an image sequence of a target scene, wherein the image sequence comprises a plurality of image frames which are continuous in time sequence;

obtaining a target frame and a reference frame according to a plurality of image frames which are continuous in the time sequence, and obtaining a depth map of the target frame based on the reference frame;

and fusing the depth map of the target frame to obtain a three-dimensional model of the target scene.
The system of claim 20, wherein the target frame comprises one of a plurality of image frames that are temporally consecutive.
The system of claim 21, wherein the reference frame comprises a frame having overlapping pixels with the target frame.
The system of claim 20, wherein the reference frame comprises at least a first image frame and a second image frame;

the first image frame temporally precedes the target frame;

the second image frame is temporally subsequent to the target frame.
The system of claim 23, wherein if the target frame is an nth frame, the first image frame is an N-1 th frame, and the second image frame is an N +1 th frame.
The system of claim 20, wherein the reference frame comprises at least a third image frame;

the third image frame is not parallel to an epipolar direction of the target frame.
The system according to any of claims 20-25, wherein the processor is configured to obtain the depth map of the target frame based on the reference frame, and specifically comprises:

and obtaining a depth map of the target frame according to the aberration between the target frame and the reference frame.
The system according to claim 20, wherein the processor is configured to obtain the depth map of the target frame based on the reference frame, and specifically comprises:

determining a matching cost corresponding to the target frame according to the reference frame;

and determining the depth map of the target frame according to the matching cost corresponding to the target frame.
The system according to claim 27, wherein the processor is configured to determine the matching cost corresponding to the target frame according to the reference frame, and specifically includes:

determining a first type matching cost and a second type matching cost corresponding to the target frame according to the target frame and the reference frame;

and determining that the matching cost corresponding to the target frame is equal to the weighted sum of the first type matching cost and the second type matching cost.
The system of claim 28, wherein the first type matching cost is determined based on a zero-mean normalized cross-correlation.
The system of claim 28, wherein the second type matching cost is determined based on illumination invariant features.
The system according to claim 27, wherein the processor is configured to determine the matching cost corresponding to the target frame according to the reference frame, and specifically includes:

dividing the target frame into a plurality of image blocks;

determining the matching cost corresponding to each image block according to the reference frame;

and determining the matching cost corresponding to the target frame according to the matching cost corresponding to each image block.
The system according to claim 31, wherein the processor is configured to divide the target frame into a plurality of image blocks, and specifically comprises:

and dividing the target frame into a plurality of image blocks in a clustering mode.
The system according to claim 31, wherein the processor is configured to divide the target frame into a plurality of image blocks, and specifically comprises:

the target frame is evenly divided into a plurality of image blocks.
The system according to claim 31, wherein the processor is configured to determine a matching cost corresponding to each image block according to the reference frame, and specifically includes:

and according to the reference frame, determining the matching cost corresponding to each image block in parallel.
The system according to claim 31, wherein the processor is configured to determine a matching cost corresponding to each image block according to the reference frame, and specifically includes:

determining the depth sampling times of each image block according to the sparse points in the image block;

and determining the matching cost corresponding to each image block according to the reference frame and the depth sampling times of each image block.
The system of claim 20, wherein the processor is further configured to perform a filtering process on the depth map of the target frame after the obtaining the depth map of the target frame based on the reference frame.
The system according to claim 36, wherein the processor is configured to perform filtering processing on the depth map of the target frame, and specifically includes:

and carrying out trilateral filtering processing on the depth map of the target frame.
The system according to claim 20, wherein the processor is configured to fuse the depth maps of the target frames to obtain a three-dimensional model of the target scene, and specifically includes:

determining a point cloud corresponding to the target frame according to the depth map of the target frame;

fusing the point cloud corresponding to the target frame into the voxel corresponding to the target scene;

and obtaining a three-dimensional model of the target scene according to the voxel corresponding to the target scene.
An unmanned aerial vehicle, comprising: a processor;

the unmanned aerial vehicle is provided with a shooting device, and the shooting device is used for shooting a target scene;

the processor is configured to perform at least one of,

acquiring an image sequence of a target scene, wherein the image sequence comprises a plurality of image frames which are continuous in time sequence;

obtaining a target frame and a reference frame according to a plurality of image frames which are continuous in the time sequence, and obtaining a depth map of the target frame based on the reference frame;

and fusing the depth map of the target frame to obtain a three-dimensional model of the target scene.
The drone of claim 39, wherein the target frame comprises one of a plurality of image frames that are consecutive in time sequence.
The drone of claim 40, wherein the reference frame comprises a frame having overlapping pixels with the target frame.
A drone as claimed in claim 39, wherein the reference frames include at least a first image frame and a second image frame;

the first image frame temporally precedes the target frame;

the second image frame is temporally subsequent to the target frame.
A drone as claimed in claim 42, wherein if the target frame is the Nth frame, the first image frame is the N-1 th frame and the second image frame is the (N + 1) th frame.
A drone as claimed in claim 39, wherein the reference frame includes at least a third image frame;

the third image frame is not parallel to an epipolar direction of the target frame.
A drone as claimed in any of claims 39-44, wherein the processor is configured to obtain the depth map of the target frame based on the reference frame, and in particular to:

and obtaining a depth map of the target frame according to the aberration between the target frame and the reference frame.
The drone of claim 39, wherein the processor is configured to obtain the depth map of the target frame based on the reference frame, and in particular comprises:

determining a matching cost corresponding to the target frame according to the reference frame;

and determining the depth map of the target frame according to the matching cost corresponding to the target frame.
The unmanned aerial vehicle of claim 46, wherein the processor is configured to determine a matching cost corresponding to the target frame according to the reference frame, and specifically includes:

determining a first type matching cost and a second type matching cost corresponding to the target frame according to the target frame and the reference frame;

and determining that the matching cost corresponding to the target frame is equal to the weighted sum of the first type matching cost and the second type matching cost.
A drone as in claim 47, wherein the first type matching cost is determined based on zero-mean normalized cross-correlation.
A drone in accordance with claim 47, wherein the second type of matching cost is determined based on illumination invariant features.
The unmanned aerial vehicle of claim 46, wherein the processor is configured to determine a matching cost corresponding to the target frame according to the reference frame, and specifically includes:

dividing the target frame into a plurality of image blocks;

determining the matching cost corresponding to each image block according to the reference frame;

and determining the matching cost corresponding to the target frame according to the matching cost corresponding to each image block.
A drone as claimed in claim 50, wherein the processor is configured to divide the target frame into a plurality of tiles, including:

and dividing the target frame into a plurality of image blocks in a clustering mode.
A drone as claimed in claim 50, wherein the processor is configured to divide the target frame into a plurality of tiles, including:

the target frame is evenly divided into a plurality of image blocks.
The unmanned aerial vehicle of claim 50, wherein the processor is configured to determine a matching cost corresponding to each image block according to the reference frame, and specifically includes:

and according to the reference frame, determining the matching cost corresponding to each image block in parallel.
The unmanned aerial vehicle of claim 50, wherein the processor is configured to determine a matching cost corresponding to each image block according to the reference frame, and specifically includes:

determining the depth sampling times of each image block according to the sparse points in the image block;

and determining the matching cost corresponding to each image block according to the reference frame and the depth sampling times of each image block.
The drone of claim 39, wherein the processor is further configured to filter the depth map of the target frame after the obtaining the depth map of the target frame based on the reference frame.
The drone of claim 55, wherein the processor is configured to filter the depth map of the target frame, and specifically includes:

and carrying out trilateral filtering processing on the depth map of the target frame.
The drone of claim 39, wherein the processor is configured to fuse the depth map of the target frame to obtain a three-dimensional model of the target scene, and specifically includes:

determining a point cloud corresponding to the target frame according to the depth map of the target frame;

fusing the point cloud corresponding to the target frame into the voxel corresponding to the target scene;

and obtaining a three-dimensional model of the target scene according to the voxel corresponding to the target scene.