CN112446845A - Map construction method, map construction device, SLAM system, and storage medium - Google Patents

Map construction method, map construction device, SLAM system, and storage medium Download PDF

Info

Publication number
CN112446845A
CN112446845A CN202011368165.6A CN202011368165A CN112446845A CN 112446845 A CN112446845 A CN 112446845A CN 202011368165 A CN202011368165 A CN 202011368165A CN 112446845 A CN112446845 A CN 112446845A
Authority
CN
China
Prior art keywords
frame
fusion
result
camera
fusion frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011368165.6A
Other languages
Chinese (zh)
Inventor
陈俊宏
王维翰
姜军
柳伟
何震宇
田第鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Peng Cheng Laboratory
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology, Peng Cheng Laboratory filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202011368165.6A priority Critical patent/CN112446845A/en
Publication of CN112446845A publication Critical patent/CN112446845A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a map construction method, which is applied to an SLAM system comprising N cameras and comprises the following steps: acquiring a target image of a target area through the N cameras, wherein N is a positive integer greater than or equal to 2; acquiring a fusion frame according to the target image, wherein the target images of the N cameras at the same moment correspond to one fusion frame; judging whether the fusion frame meets a first preset condition or not; when the fused frame meets the first preset condition, adding the fused frame to a key frame queue; and constructing a map of the target area by using the key frame queue. The invention also discloses a map construction device, an SLAM system and a storage medium. When the method is used for constructing the map of the target area, the data processing amount is less, and the map construction efficiency is higher.

Description

Map construction method, map construction device, SLAM system, and storage medium
Technical Field
The present invention relates to the field of map construction, and in particular, to a map construction method, apparatus, SLAM system, and storage medium.
Background
The SLAM system is used for realizing real-time positioning or map construction, and is widely applied to light-weight platforms such as AR/MR glasses and quad-rotor unmanned aerial vehicles due to the characteristics of miniaturization, economy and the like. The SLAM system comprises a monocular camera, a stereoscopic vision camera or a depth color camera, acquires a plurality of target images of a target area at different moments through the monocular camera, respectively tracks and matches the plurality of target images, and completes local image construction processing.
However, when the conventional method is used for mapping by using the target image of the monocular camera, the mapping efficiency is low.
Disclosure of Invention
The invention mainly aims to provide a map construction method, a map construction device, an SLAM system and a storage medium, and aims to solve the technical problem that the map construction efficiency is low when the target image of a monocular camera is used for map construction by adopting the existing method in the prior art.
In order to achieve the above object, the present invention provides a map construction method applied to a SLAM system including N cameras, the method including the steps of:
acquiring a target image of a target area through the N cameras, wherein N is a positive integer greater than or equal to 2;
acquiring a fusion frame according to the target image, wherein the target images of the N cameras at the same moment correspond to one fusion frame;
judging whether the fusion frame meets a first preset condition or not;
when the fused frame meets the first preset condition, adding the fused frame to a key frame queue;
and constructing a map of the target area by using the key frame queue.
Optionally, the fused frame includes matching feature points that satisfy a preset matching condition and monocular feature points that do not satisfy the preset matching condition; before the step of judging whether the fused frame meets the first preset condition, the method further includes:
determining the monocular feature points in the fusion frame, which satisfy the preset matching condition with the matching feature points in the fusion frame at the previous moment, as new matching feature points;
obtaining a result fusion frame according to the newly added matching feature points and the fusion frame;
the step of judging whether the fusion frame meets a first preset condition comprises the following steps:
judging whether the result fusion frame meets a first preset condition or not;
the step of adding the fused frame to a key frame queue when the fused frame meets the first preset condition comprises:
and when the result fusion frame meets the first preset condition, adding the result fusion frame to a key frame queue.
Optionally, the first preset condition includes one necessary constraint and one-out-of-three constraint; wherein the content of the first and second substances,
the necessary constraint is that the total number of the matched feature points of the result fusion frame is not less than a first preset threshold;
the one-out-of-three constraint includes that the number of result fused frames between the result fused frame and a previous result fused frame satisfying the first preset condition exceeds a first preset number of frames, or,
the number of result fusion frames between the result fusion frame and the previous result fusion frame satisfying the first preset condition exceeds a second preset frame number and the number of result fusion frames satisfying the second preset condition in the key frame queue is 0, or,
the number of result fusion frames meeting the second preset condition in the key frame queue does not exceed a second preset threshold;
the first preset frame number is greater than the second preset frame number.
Optionally, before the step of determining whether the result fusion frame is a key frame, the method further includes:
determining a reference camera from the N cameras, and determining a coordinate system corresponding to the reference camera as a world coordinate system;
acquiring internal parameters of the N cameras;
obtaining world coordinates and pixel coordinates of all matched feature points in the result fusion frame;
obtaining the estimated poses of the result fusion frame and the initial moment result fusion frame by using a uniform motion model hypothesis method;
performing iterative optimization on the estimated pose according to the world coordinate system, the internal parameters, the world coordinates and the pixel coordinates through a formula I to obtain a result pose of the result fusion frame;
the step of judging whether the result fusion frame meets a first preset condition comprises the following steps:
judging whether the result fusion frame with the result pose meets a first preset condition or not;
the first formula is as follows:
Figure BDA0002804783600000031
Figure BDA0002804783600000032
wherein i is the ith camera of the N cameras, j is the jth feature point of the all matched feature points included in the result fusion frame, k is the kth moment corresponding to the result fusion frame,
Figure BDA0002804783600000033
h is a re-projection function, P is the re-projection error of the jth characteristic point at the kth moment in a camera coordinate system corresponding to the ith camerajIs the world coordinate of the j-th feature point,
Figure BDA0002804783600000034
is the kth characteristic point at the kth momenti depth coordinates in the camera coordinate system to which the camera corresponds,
Figure BDA0002804783600000035
is the abscissa of the jth characteristic point at the kth moment in the camera coordinate system corresponding to the ith camera,
Figure BDA0002804783600000036
is the ordinate, f, of the kth characteristic point in the camera coordinate system corresponding to the ith camera at the kth momentx、fyTwo intrinsic parameters of the ith camera respectively,
Figure BDA0002804783600000037
for the estimated rotation matrix of the reference camera at the kth time in the estimated pose,
Figure BDA0002804783600000038
for the estimated translation vector of the reference camera at the kth time in the estimated pose,
Figure BDA0002804783600000039
matrix obtained for SE3 transformation of said estimated rotation matrix and said estimated translation vector, I3In the form of a third-order identity matrix,
Figure BDA00028047836000000310
is the translation vector of the world coordinate system to the camera coordinate system corresponding to the ith camera,
Figure BDA00028047836000000311
and the rotation matrix is from the world coordinate system to a camera coordinate system corresponding to the ith camera.
Optionally, the step of constructing a map of the target area by using the keyframe queue includes:
obtaining a local key frame set according to the result fusion frame in the key frame queue;
obtaining a local matching feature point set through the local key frame set;
updating the result pose of the result fusion frame and the world coordinates of the matching feature points included in the result fusion frame through a formula II according to the local key frame set and the local matching feature point set to obtain an updated result pose and updated world coordinates;
obtaining a selected fused frame based on the result fused frame with the updated result pose and the updated world coordinates;
constructing a map of the target area by using the selected fusion frame;
the second formula is:
Figure BDA0002804783600000041
Figure BDA0002804783600000042
where ρ is the kernel function, QijIs the covariance matrix of the j point in the camera coordinate system corresponding to the i camera, BLFor the local key frame set, BFTo include the matching feature point and not in the BLThe results in (1) are fused with a set of frames, N is the total number of cameras,
Figure BDA0002804783600000043
and the result fusion frame comprises a set consisting of all the matched feature points.
Optionally, after the step of constructing the map of the target area by using the selected fusion frame, the method further includes:
screening candidate key frames from all historical selected fusion frames at the moment before the selected fusion frame;
SIM3 transformation is carried out on the selected fusion frame and the candidate key frame to obtain a similarity transformation matrix;
utilizing the similarity transformation matrix to adjust the selected fusion frame and the historical selected fusion frame to obtain an adjusted selected fusion frame and an adjusted historical selected fusion frame;
correcting the result pose of the selected and adjusted fusion frame, the world coordinates of the matched feature points included in the selected and adjusted fusion frame, the result pose of the selected and adjusted fusion frame and the world coordinates of the matched feature points included in the selected and adjusted fusion frame to obtain a corrected fusion frame pose, a corrected fusion frame world coordinate, a corrected history fusion frame pose and a corrected history fusion frame world coordinate;
and correcting the map of the target area by using the corrected fusion frame pose, the corrected fusion frame world coordinate, the corrected historical fusion frame pose and the corrected historical fusion frame world coordinate to obtain a result map of the target area.
Optionally, the step of screening candidate key frames from all historical selected fusion frames at a time before the selected fusion frame includes:
calculating bag-of-word scores of the selected fusion frame and all historical selected fusion frames corresponding to the selected fusion frame at the previous moment;
screening the primary selection fusion frames meeting a third preset condition from all the historical selection fusion frames;
screening selected key frames with the word bag score exceeding a preset word bag score threshold value in the primary selection fusion frames;
calculating the maximum value of the number of common words of the selected key frame and the selected fusion frame;
and screening candidate key frames with the common word number larger than a preset word number threshold value from the selected key frames.
Further, to achieve the above object, the present invention also proposes a map construction apparatus applied to a SLAM system including N cameras, the apparatus including:
the acquisition module is used for acquiring a target image of a target area through the N cameras, wherein N is a positive integer greater than or equal to 2;
an obtaining module, configured to obtain a fusion frame according to the target image, where the target images of the N cameras at the same time correspond to one fusion frame;
the judging module is used for judging whether the fusion frame meets a first preset condition or not;
the adding module is used for adding the fusion frame to a key frame queue when the fusion frame meets the first preset condition;
and the construction module is used for constructing a map of the target area by utilizing the key frame queue.
In addition, to achieve the above object, the present invention also provides a SLAM system, including: n cameras, a memory, a processor and a map building program stored on the memory and running on the processor, the map building program when executed by the processor implementing the steps of the map building method as claimed in any one of the above; wherein N is a positive integer greater than or equal to 2.
Furthermore, to achieve the above object, the present invention also proposes a storage medium having stored thereon a map construction program which, when executed by a processor, implements the steps of the map construction method according to any one of the above.
The technical scheme of the invention is applied to an SLAM system comprising N cameras by adopting a map construction method, and the method comprises the following steps: acquiring a target image of a target area through the N cameras, wherein N is a positive integer greater than or equal to 2; acquiring a fusion frame according to the target image, wherein the target images of the N cameras at the same moment correspond to one fusion frame; judging whether the fusion frame meets a first preset condition or not; when the fused frame meets the first preset condition, adding the fused frame to a key frame queue; and constructing a map of the target area by using the key frame queue. Because the fusion frames meeting the first preset condition are added to the key frame queue when the target area map is constructed, and the key frame queue is utilized to construct the map of the target area, the map of the target area is not constructed by utilizing all the fusion frames.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a SLAM system of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first exemplary embodiment of a map construction method according to the present invention;
FIG. 3 is a hash representation of the present invention;
FIG. 4 is a schematic illustration of a triangulation process;
FIG. 5 is a diagram illustrating a structure of a fused frame;
FIG. 6 is a block diagram showing the construction of a first embodiment of the map building apparatus according to the present invention;
FIG. 7 is a comparison of the accuracy of the method of the present invention with other mapping methods.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a schematic diagram of a SLAM system structure of a hardware operating environment according to an embodiment of the present invention is shown.
Generally, a SLAM system includes: a camera 307, at least one processor 301, a memory 302, and a fused frame obtaining program stored on the memory and executable on the processor, the fused frame obtaining program configured to implement the steps of the fused frame obtaining method as previously described.
The processor 301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 301 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. The processor 301 may further include an AI (Artificial Intelligence) processor for processing operations related to the fusion frame acquisition method, so that the fusion frame acquisition method model can be trained and learned autonomously, improving efficiency and accuracy.
Memory 302 may include one or more computer-readable storage media, which may be non-transitory. Memory 302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 302 is used to store at least one instruction for execution by processor 801 to implement the fused frame acquisition method provided by method embodiments herein.
The cameras 307 include N cameras, which may have a rotation angle and a translation distance therebetween, for acquiring a target image of a target area. Generally, the N cameras simultaneously acquire target images of a target area at the same time, and the multiple target images are different due to the rotation angle and the distance between the multiple cameras. The N cameras may be high-definition cameras or ordinary cameras, and the cameras in the N cameras may be monocular cameras, stereoscopic cameras, and depth color cameras.
In some embodiments, the SLAM system may further optionally include: a communication interface 303 and at least one peripheral device. The processor 301, the memory 302 and the communication interface 303 may be connected by a bus or signal lines. Various peripheral devices may be connected to communication interface 303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 304, a display screen 305, and a power source 306.
The communication interface 303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 301 and the memory 302. In some embodiments, processor 301, memory 302, and communication interface 303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 301, the memory 302 and the communication interface 303 may be implemented on a single chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 304 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 304 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 304 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 304 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 305 is a touch display screen, the display screen 305 also has the ability to capture touch signals on or over the surface of the display screen 305. The touch signal may be input to the processor 301 as a control signal for processing. At this point, the display screen 305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 305 may be one, the front panel of the electronic device; in other embodiments, the display screens 305 may be at least two, respectively disposed on different surfaces of the electronic device or in a folded design; in still other embodiments, the display screen 305 may be a flexible display screen disposed on a curved surface or a folded surface of the electronic device. Even further, the display screen 305 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 305 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The power supply 306 is used to power various components in the electronic device. The power source 306 may be alternating current, direct current, disposable or rechargeable. When the power source 306 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology. Those skilled in the art will appreciate that the architecture shown in fig. 1 does not constitute a limitation of the fused frame obtaining apparatus and may include more or less components than those shown, or combine some components, or a different arrangement of components.
Furthermore, an embodiment of the present invention further provides a storage medium, where a fused frame obtaining program is stored, and the fused frame obtaining program, when executed by a processor, implements the steps of the fused frame obtaining method as described above. Therefore, a detailed description thereof will be omitted. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. Determining as an example, the program instructions may be deployed to execute on one SLAM system, or on multiple SLAM systems located at one site, or distributed across multiple sites and interconnected by a communication network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Based on the hardware structure, the embodiment of the map construction method is provided.
Referring to fig. 2, fig. 2 is a flowchart of a first embodiment of a mapping method according to the present invention; the method is applied to a SLAM system and comprises the following steps:
step S11: and acquiring a target image of a target area through the N cameras, wherein N is a positive integer greater than or equal to 2.
It should be noted that the target area may refer to an area through which the SLAM system passes during operation, and feature information included in the target area may be captured by multiple cameras of the SLAM system. The target region may be an unknown region or a known region, and the present invention is not limited thereto.
The target images are obtained by N cameras at different moments, one camera corresponds to one target image at the same moment, and all the target images at the same moment are a group of target images; for example, the N cameras are 4 cameras, the different times are 3 different times, the target images acquired by the N cameras are 12 target images, and the 4 target images at the same time are a group of target images.
In specific application, in order to ensure the completeness and the better of the composition of the target region, different moments are taken as the final selection of continuous moments in a time period, and a user can also perform other settings according to the requirements of the user, which is not limited by the method and the device.
Step S12: and acquiring a fusion frame according to the target image, wherein the target images of the N cameras at the same moment correspond to one fusion frame.
A fused frame is obtained by using a group of target images at the same time, for example, 4 cameras are used as N cameras, 3 different times are used as different times, 12 target images are obtained by the N cameras, 4 target images at the same time are used as a group of target images, and 3 fused frames are obtained by using 4 target images belonging to the same time in 3 different times.
It is to be understood that the method for obtaining the fusion frame may be any method based on the feature points of the target image, and the method is not limited thereto, and preferably, the fusion frame is obtained by using the following method.
The first step is as follows: and performing feature extraction on the N target images at the same moment to obtain a feature point set.
In the image processing, the feature point refers to a point where the image gray value is changed drastically or a point where the curvature is large on the edge of the image in the image processing, and the feature may be classified into a color feature and a texture feature. The color feature is a global feature and describes surface properties of scenes corresponding to the images or the image areas, and the general color feature is based on the features of pixel points, so that all pixels belonging to the images or the image areas have respective contributions; since color is not sensitive to changes in the orientation, size, etc. of an image or image region, color features do not capture local features of objects in an image well; in addition, when only color features are used for query, if the database is large, a plurality of unnecessary images are often searched; the color histogram is the most commonly used method for expressing color features, and has the advantages of being not influenced by image rotation and translation changes, further being not influenced by image scale changes by means of normalization, and having the basic disadvantage of not expressing the feature information of color space distribution. A texture feature is also a global feature that also describes the surface properties of the scene to which the image or image area corresponds.
In a specific application, the feature points may be feature points extracted by any one of the orb, surf, or sift feature extraction algorithms, or feature points extracted by other methods, which is not limited in the present invention.
Wherein the feature point includes a camera ID, a descriptor, and pixel coordinates (2D coordinates) of the feature point; wherein the descriptor may be a vector describing information of pixels around the feature point, and the vector may be a binary number. The camera ID refers to the number of the feature point in the target image, and since the same feature point is extracted from N target images possibly at the same moment, the same feature point can simultaneously correspond to N camera IDs at the same moment; the descriptor refers to the descriptor of the feature point in the target image, the descriptors of the same feature point in different target images at the same time may be different, the pixel coordinate refers to the pixel coordinate of the feature point in the target image, and the pixel coordinate of the same feature point in different target images at the same time may be different.
It is understood that, for the sake of convenience, the target points in the following description refer to the object points corresponding to the feature points and objectively existing in the target area.
For example, the SLAM system has 4 cameras, a camera, b camera, c camera and d camera, at the same time, 4 cameras take 4 target images of a target area, feature extraction is performed on the 4 target images, feature points a corresponding to the target area are extracted, in the target image of the a camera, the feature point camera ID is 12, the feature point a descriptor is 1010100, the pixel coordinates are (13, 11), in the target image of the b camera, the feature point camera ID is 22, the feature point a descriptor is 1010110, the pixel coordinates are (23, 11), in the target image of the c camera, the feature point camera ID is 32, the feature point a descriptor is 1010111, the pixel coordinates are (33, 11), in the target image of the d camera, the feature point camera ID is 32, the feature point a descriptor is 1010101, the pixel coordinates are (43, 11), at this time, there are 4 descriptors corresponding to a feature points, there are also 4 camera IDs and 4 pixel coordinates.
And obtaining a feature point set comprising feature points and feature information of the feature points according to the extracted feature points, wherein the feature information of the feature points is the camera ID, the pixel coordinates and the descriptor.
The second step is that: and obtaining a hash table based on the feature point set obtained in the first step.
Specifically, the second step includes: determining matched feature points meeting preset matching conditions in the feature point set; determining monocular feature points which do not meet the preset matching condition in the feature point set; and obtaining a hash table based on the matching feature points and the monocular feature points.
It should be noted that, when descriptors of a plurality of feature points are close, the feature points satisfy a preset matching condition, the close degree of the descriptors is not limited in the present invention, and a user can set the close degree according to the own requirement, but in order to ensure high accuracy of the obtained matching feature points, it is better to set the short range of the descriptors of the preset matching condition; for example, the descriptor is a 7-bit binary number, and the preset matching condition is that the first five bits of the descriptor are the same, where the preset matching condition described below is the same as the preset matching condition.
Further, in the second step, the step of obtaining a hash table based on the matching feature point and the monocular feature point includes: acquiring a first descriptor, a first pixel coordinate and a first camera ID of the matched feature point in the N target images; acquiring a second descriptor, a second pixel coordinate and a second camera ID of the monocular feature point in the N target images; obtaining a hash table based on the matching feature point, the monocular feature point, the first descriptor, the first pixel coordinate, the first camera ID, the second descriptor, the second pixel coordinate, and the second camera ID.
After the feature points are extracted, the feature points extracted from the target images corresponding to different cameras at the same time are taken as a set, that is, N sets correspond to N cameras. Matching all the feature points in the N sets, combining the feature points meeting preset matching conditions into a matching feature point, wherein the matching feature point corresponds to a plurality of camera IDs, a plurality of descriptors and a plurality of pixel coordinates to serve as feature information of the matching feature point, and not all target images of each camera can extract the matching feature point, but only the target images of at least two cameras can extract the matching feature point, so that the plurality of values are not more than N; and taking the feature points which do not meet the preset matching condition in the N sets as monocular feature points, wherein the feature information of the monocular feature points is a camera ID, a descriptor and a pixel coordinate corresponding to the feature points.
For example, the SLAM system has 4 cameras, a camera, b camera, c camera and d camera, at the same time, 4 cameras take 4 target images of a target area, feature extraction is performed on the 4 target images, feature points a corresponding to a target point of the target area are extracted, in the target image of the a camera, the feature point a camera ID is 12, the descriptor is 1010100, the pixel coordinates are (13, 11), in the target image of the b camera, the feature point a camera ID is 22, the descriptor is 1010110, the pixel coordinates are (23, 11), in the target image of the c camera, the feature point a camera ID is 32, the descriptor is 1010111, the pixel coordinates are (33, 11), in the target image of the d camera, the feature point a camera ID is 32, the descriptor is 1010101, the pixel coordinates are (43, 11), in this case, the a feature point is a matching feature point, the corresponding descriptor has 4, there are also 4 camera IDs, there are also 4 pixel coordinates, and the feature information of the matching feature point a includes four camera IDs — 12, 22, 32, and 42, 4 descriptors — 1010100, 1010110, 1010111, and 1010101, 4 pixel coordinates — (13, 11), (23, 11), (33, 11), and (43, 11).
Referring to fig. 3, fig. 3 is a hash representation intention of the present invention, all matching feature points and monocular feature points are respectively assigned to 0-M-1, that is, the sum of the matching feature points and the monocular feature points is M, so that the assignment is a key value id (key) of the hash table, one matching feature point or one monocular feature point corresponds to one key value, and feature information of the matching feature points and the monocular feature points is used as a value (value) of the hash table. The SLAM system comprises N cameras, so that a value of a key value in a hash table comprises N different feature information, namely feature information of a feature point corresponding to a target point in N target images of the N cameras respectively, a value (value) of the hash table is a camera ID, and a description sub-coordinate and a pixel coordinate are attached to the back of the corresponding camera ID in a sub-feature information mode; the value (value) of the hash table may be a camera ID attached to the descriptor and the pixel coordinate, or may be information obtained by combining the descriptor, the pixel coordinate, and the camera ID, and the present invention is not limited thereto.
It can be understood that, when a first value of a key value (i.e., a feature point corresponding to a target point, which may be a monocular feature point or a matching feature point) is extracted from a target image of a first camera, a corresponding camera ID may be determined according to a position of the value; in addition, when a feature point cannot be extracted from a target image, the camera ID of the feature point in the camera corresponding to the target image is assigned to-1, and the camera ID is not attached to the descriptor and the pixel coordinates, wherein when the camera ID is assigned to-1, that is, the feature point is not extracted from the target image of the camera, the assignment may also be in other forms, such as-2, and the present invention is not limited to-1.
In fig. 3, the feature information of the feature points corresponding to the key 104 is N, and from left to right, the feature information is not extracted from the target image of camera No. 1, the feature information is extracted from the target image of camera i, the corresponding camera ID is 201, the feature information is extracted from the target image of camera N, and the corresponding camera ID is 3.
It can be understood that the indication of the hash table provided by the present invention is a better choice, and other types of hash tables can be determined according to the idea of the present invention, which is not described herein in detail.
The third step: and obtaining a map point set based on the hash table obtained in the second step.
Specifically, the third step includes: based on the hash table, obtaining the selected pixel coordinates corresponding to the matching feature points; obtaining selected camera coordinates of the matched feature points based on the selected pixel coordinates of the matched feature points; obtaining depth information of the matched feature points based on the selected camera coordinates of the matched feature points; obtaining world coordinates of the matched feature points based on the depth information of the matched feature points; and obtaining the map point set based on the world coordinates of the matched feature points and the matched feature points.
It should be noted that, with reference to the structure of the hash table, by assigning values to key values, i.e., feature points, of the hash table, feature information of matching feature points corresponding to the key values is found, and a camera to which the feature information belongs is determined as a selected camera, and a camera coordinate system to which the selected camera belongs is a selected camera coordinate system; it can be understood that, when a certain value Q of a key value P in the hash table is-1, the camera to which the value Q corresponding to the key value P belongs is not the selected camera; and the pixel coordinates of the matched feature points in the target image of the selected camera are the selected pixel coordinates, and the selected pixel coordinates are utilized to obtain corresponding normalized selected camera coordinates. Obtaining depth information of the matched feature points based on the selected camera coordinates of the matched feature points; obtaining world coordinates of the matched feature points based on the depth information of the matched feature points; and obtaining the map point set based on the world coordinates of the matched feature points and the matched feature points.
The depth information refers to depth information of the matched feature points in a camera coordinate system to which the selected camera belongs, and it can be understood that the world coordinates of the corresponding matched feature points can be obtained only by one selected camera coordinate and one corresponding depth information; that is, for all the matching feature points, one depth information of each matching feature point is obtained, and it is not necessary to obtain the depth information of all the selected pixel coordinate systems of the matching feature points.
For example, the number of cameras is 4, the feature points a are extracted from the target images of 3 cameras out of 4 cameras, that is, the feature points a are matching feature points, in the obtained hash table, the key value of the matching feature points a is 15, there are 4 corresponding values, one of the values is-1, that is, no matching feature points a are extracted from the target image of the camera to which the-1 value belongs; then, the camera to which the other 3 values belong is the selected camera, the pixel coordinate of the matched feature point a corresponding to the selected camera is the selected pixel coordinate, and the camera coordinate corresponding to the selected pixel coordinate is the selected camera coordinate. And when the world coordinates are obtained by using the 3 selected camera coordinates, only the depth information (L) of any one selected camera coordinate (such as L) is obtained from the 3 selected camera coordinates, and the world coordinates of the feature point A can be obtained by using the L and the L.
In addition, the corresponding selected camera coordinates are obtained through the formula three and the selected pixel coordinates.
The third formula is:
Figure BDA0002804783600000141
Figure BDA0002804783600000142
wherein u is the abscissa in the selected pixel coordinate, v is the ordinate in the selected pixel coordinate,
Figure BDA0002804783600000143
to select the abscissa of the camera coordinates,
Figure BDA0002804783600000144
for the ordinate of the selected camera coordinate, fx,fy,cx,cyAll are intrinsic parameters of the selected camera; it is understood that the intrinsic parameters of different cameras corresponding to the same feature point may be different; the coordinates obtained according to equation 1 include only the abscissa
Figure BDA0002804783600000145
And ordinate
Figure BDA0002804783600000146
Therefore, the camera coordinates at this time are normalized camera coordinates expressed as
Figure BDA0002804783600000147
Further, the step of obtaining depth information of the matching feature points based on the selected camera coordinates of the matching feature points comprises: determining two result camera coordinates in the selected camera coordinates of the matched feature points; and obtaining the depth information of the matched feature points by a triangulation method and a least square method according to the two result camera coordinates.
It can be understood that there are at least two pieces of feature information corresponding to the matching feature points, that is, in the hash table shown in fig. 3, there are at least two values other than-1 corresponding to the matching feature points; when the number of the matched feature points is large, namely, when the key values corresponding to the matched feature points are large in the hash table and have values which are not-1, determining feature information corresponding to any two values which are not-1 as result feature information, using camera coordinates obtained by pixel coordinates in the result feature information as result camera coordinates, and using the camera to which the result camera coordinates of the matched feature points belong as a result camera.
The trigonometric method has the following formula, i.e., formula four:
s2X2=s1RX1+t
wherein, X1For one of the two resulting camera coordinates, X2For the other of the two resulting camera coordinates, s1Is X1Depth information in the corresponding camera coordinate system, s2Is X2Depth information in the corresponding camera coordinate system, R being X1Corresponding camera coordinate system and X2Corresponding rotation matrix of camera coordinate system, t is X1Corresponding camera coordinate system and X2Translation vectors of the corresponding camera coordinate system. It will be appreciated that the coordinates X may be expressed for the camera as described above
Figure BDA0002804783600000151
Referring to fig. 4, fig. 4 is a schematic diagram of the triangulation method, where P is an original location map corresponding to a feature point in the target image of the SLAM system, i.e. the actual location of the feature point in world coordinates, I1As a result of the target image in the camera, P1Is P at I1Coordinate pixel of (1), I2For a target image in another result camera, P2Is P at I2The coordinate pixel of (2).
In a specific application, the depth information of any one of the two resulting camera coordinates is obtained by using a formula four and a minimum multiplication method, and one resulting camera coordinate corresponding to the depth information is a target camera coordinate, for example, by using s in a formula two1When the depth information is X corresponding to the depth information1As target camera coordinates; restoring original camera coordinates corresponding to target camera coordinates, i.e. target camera coordinates, using depth information
Figure BDA0002804783600000152
Figure BDA0002804783600000153
The abscissa, ordinate and depth in (1) are multiplied by the depth information Z, respectively, wherein the original camera coordinates are coordinates in a camera coordinate system to which the target camera coordinates belong. For example, X1Resulting camera coordinates of A feature points for Camera # 1, in X1Depth information s of1For selected depth information, X is obtained1The original camera coordinates (X, Y, Z) of (1) which are the coordinates in the camera coordinate system to which the camera belongs.
Further, the step of obtaining the world coordinates of the matching feature points based on the depth information of the matching feature points includes: and obtaining the world coordinates of the matched feature points based on the depth information of the matched feature points, the target camera coordinates corresponding to the depth information, the rotation matrix of the target camera coordinate system corresponding to the target camera coordinates and the world coordinate system, and the translation vector of the target camera coordinate system and the world coordinate system.
It can be understood that only one depth information is required for each matched feature point, and the original camera coordinates corresponding to the feature point are obtained according to the depth information and the target camera coordinates corresponding to the depth information. Wherein the camera coordinate system to which the original camera belongs is obtained based on the target camera coordinate system,for example, X1One of the resulting camera coordinates of the A feature points for Camera # 1, in X1Depth information s of1For selected depth information (at this point, the resulting camera coordinates X1As target camera coordinates) to obtain X1The original camera coordinates (X, Y, Z) of (a), the original camera coordinates being coordinates in the camera coordinate system to which camera No. 1 belongs, the camera coordinate system to which camera No. 1 belongs being the target camera coordinate system. And then, obtaining the world coordinates of the matched feature points and the world coordinates of the original camera coordinates according to the rotation matrix of the target camera coordinate system and the world coordinate system and the translation vectors of the target camera coordinate system and the world coordinate system, wherein the world coordinates of the original camera coordinates are the world coordinates of the feature points.
In a specific application, the leftmost camera of the N cameras in the SLAM system may be the first camera, and the coordinate system of the initial time of the first camera may be the world coordinate system, or the world coordinate system may be determined in other manners, which is not limited in the present invention.
When the world coordinates of the matching feature points are obtained by using the hash table, key values of the hash table are used as map point elements of a map point set, the world coordinates of the matching feature points corresponding to the key values are used as the world coordinates of the key values, the world coordinates corresponding to the monocular feature points are assigned to be 0, and the map point set is obtained according to the key values and the world coordinates.
Step S15: and obtaining a description subset based on the hash table obtained in the second step.
Further, step S15 includes: based on the hash table, obtaining a first selected descriptor corresponding to the matching feature point and a second selected descriptor corresponding to the monocular feature point; determining a first result descriptor in the first selected descriptor when the matching feature point is extracted for the first time; determining a second result descriptor when the monocular feature points are extracted for the first time in the second selected descriptor; and obtaining the description subset according to the matching feature points, the monocular feature points, the first result descriptor and the second result descriptor.
It should be noted that, for a matched feature point, the key value corresponds to a plurality of feature information, and the key value of the monocular feature point corresponds to one feature information, each value includes feature information (descriptor, camera ID, and pixel coordinate) of the feature point under different cameras; in the matching feature points, determining effective feature information (i.e. feature information with a value not corresponding to-1 camera ID) as selected feature information, and determining descriptors in a target image corresponding to the matching feature points when the matching feature points are extracted for the first time as first result descriptors in the selected feature information; in the monocular feature point, only one descriptor is determined as the second result descriptor, and it is understood that the descriptor is necessarily the descriptor corresponding to the feature point extracted for the first time.
The method includes the steps that a key value in a hash table is used as an element unit of a description subset, a corresponding first result descriptor or a corresponding second result descriptor is used as a selected descriptor of the key value, and the description subset is obtained according to the key value, the first result descriptor and the second result descriptor.
Step S16: and obtaining a fusion frame based on the hash table obtained in the second step, the map point set obtained in the third step and the description subset obtained in the fourth step.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a fused frame, where the fused frame is obtained based on target images of N cameras at a certain time, shapes (representing angles and positions of the target images) of frames (i.e., the target images) corresponding to the cameras are different, where poses refer to different angles and displacement amounts, and a virtual fused frame is the fused frame, and includes a map point set, a hash table, and a description subset, where a sum of matching feature points and monocular feature points is M, a number of corresponding map points in the map point set is M, and a number of descriptors in the description subset is also M; the set of map points is derived from map points, illustratively 5 map points-P1, P2, P3, P4 and P5-feature points, illustratively feature points comprising 9 points, of which 9 points there are matching feature points and monocular feature points.
When fusion frames at different moments are obtained, determining monocular feature points, meeting the preset matching conditions, of the matching feature points in the fusion frames and the fusion frames at the previous moment as newly added matching feature points; and updating the feature points of the fusion frame by using the newly added matching feature points to obtain a result fusion frame, wherein all the matching feature points of the result fusion frame are the sum of the matching feature points of the fusion frame and the newly added matching feature points. The preset matching condition refers to the description of the preset matching condition.
For example, for the k-th time, the fused frame includes 350 matching feature points, 250 monocular feature points, 330 matching feature points at the (k-1) th time and 270 monocular feature points, and the 250 matching feature points at the k-th time are used for matching with the 330 matching feature points at the (k-1) th time, wherein 30 points are matched, then 30 newly added matching feature points are added, and the result fused frame at the k-th time includes 380 matching feature points.
Step S13: and judging whether the fusion frame meets a first preset condition or not.
It should be noted that the first preset condition includes a necessary constraint and a one-out-of-three constraint; the necessary constraint is that the total number of the matched feature points of the result fusion frame is not less than a first preset threshold; the third-out-of-one constraint comprises that the number of result fusion frames between the result fusion frame and a previous result fusion frame meeting the first preset condition exceeds a first preset frame number, or that the number of result fusion frames between the result fusion frame and a previous result fusion frame meeting the first preset condition exceeds a second preset frame number and the number of result fusion frames meeting the second preset condition in the key frame queue is 0, or that the number of result fusion frames meeting the second preset condition in the key frame queue does not exceed a second preset threshold; the first preset frame number is greater than the second preset frame number.
It can be understood that the first preset condition includes a necessary condition that the total number of the matching feature points of the result fusion frame is not less than a first preset threshold, and a third selected condition of the first preset condition is that the number of result fusion frames between the result fusion frame and a previous result fusion frame meeting the first preset condition exceeds a first preset frame number, or that the number of result fusion frames between the result fusion frame and a previous result fusion frame meeting the first preset condition exceeds a second preset frame number and the number of result fusion frames meeting the second preset condition in the key frame queue is 0, or that the number of result fusion frames meeting the first preset condition in the key frame queue does not exceed one of second preset thresholds. The first preset threshold and the second preset threshold are both thresholds set by a user according to the user's own requirements, the invention is not limited, and the number of the second preset thresholds is preferably 3; the first preset frame number is the maximum value of the number of the fusion frames between the adjacent result fusion frames meeting the first preset condition, the second preset frame number is the minimum value of the number of the fusion frames between the adjacent result fusion frames meeting the first preset condition, and the maximum value and the minimum value can be set by a user according to the requirement of the user; the second preset condition may refer to the queued fused key frames in the key frame queue that are not used for composition.
Specifically, when the number of result fusion frames between the result fusion frame and a previous result fusion frame meeting the first preset condition exceeds a first preset frame number, determining the result fusion frame as meeting the first preset condition, so as to avoid that the interval between the result fusion frames is too far and the information quantity is lost, and simultaneously avoid that the interval between the result fusion frames is too close and the information is overlapped more; meanwhile, when the number of result fusion frames between the result fusion frame and the previous result fusion frame meeting the first preset condition exceeds a second preset frame number and the number of result fusion frames meeting the second preset condition in the key frame queue is 0, the current key frame queue is idle, the result fusion frame can be used as the result fusion frame meeting the first preset condition, and the phenomenon that the interval between the result fusion frames is too short and the information is excessively overlapped is avoided.
Further, before step S13, the method further includes: determining a reference camera from the N cameras, and determining a coordinate system corresponding to the reference camera as a world coordinate system; acquiring internal parameters of the N cameras; obtaining world coordinates and pixel coordinates of all matched feature points in the result fusion frame; obtaining the estimated poses of the result fusion frame and the initial moment result fusion frame by using a uniform motion model hypothesis method; and performing iterative optimization on the estimated pose according to the world coordinate system, the internal parameters, the world coordinate and the pixel coordinate through a formula I to obtain a result pose of the result fusion frame. The first formula is as follows:
Figure BDA0002804783600000191
Figure BDA0002804783600000192
wherein i is the ith camera of the N cameras, j is the jth feature point of the matched feature points included in the result fusion frame, k is the kth moment corresponding to the result fusion frame,
Figure BDA0002804783600000193
the reprojection error of the kth characteristic point at the kth moment in a camera coordinate system corresponding to the ith camera, h is a reprojection function, PjIs the world coordinate of the j-th feature point,
Figure BDA0002804783600000194
is the depth coordinate of the jth characteristic point at the kth moment in the camera coordinate system corresponding to the ith camera,
Figure BDA0002804783600000195
is the j-th characteristic pointThe abscissa of the k-th camera in the camera coordinate system corresponding to the ith camera,
Figure BDA0002804783600000196
is the ordinate, f, of the kth characteristic point in the camera coordinate system corresponding to the ith camera at the kth momentx、fyTwo intrinsic parameters of the ith camera respectively,
Figure BDA0002804783600000197
for the estimated rotation matrix of camera No. 1 at time k in the estimated pose,
Figure BDA0002804783600000198
for the estimated translation vector of camera No. 1 at time k in the estimated pose,
Figure BDA0002804783600000199
matrix obtained for SE3 transformation of said estimated rotation matrix and said estimated translation vector, I3In the form of a third-order identity matrix,
Figure BDA00028047836000001910
is the translation vector of the world coordinate system to the camera coordinate system corresponding to the ith camera,
Figure BDA00028047836000001911
and the rotation matrix is from the world coordinate system to a camera coordinate system corresponding to the ith camera.
Accordingly, step S13 includes: and judging whether the result fusion frame with the result pose meets a first preset condition.
The reference camera of the present invention may be the leftmost camera of the N cameras, or may be a camera at another position, and the present invention is not limited thereto, wherein the camera coordinate system to which the reference camera belongs at the initial time is a world coordinate system.
And meanwhile, on the basis of a uniform motion model method, multiplying the pose changes of the result fusion frame at the first two moments corresponding to the result fusion frame by the pose of the result fusion frame at the last moment of the result fusion frame to obtain the estimated pose of the result fusion frame.
Step S14: and when the fused frame meets the first preset condition, adding the fused frame to a key frame queue.
Further, step S14 includes: and when the result fusion frame meets the first preset condition, adding the result fusion frame to a key frame queue.
It should be noted that the key frame queue is used for storing the result fusion frame meeting the first preset condition. Because data processing is more during the construction of the map of the target area, when a result fusion frame meeting a first preset condition is obtained, the result fusion frame meeting the first preset condition cannot be used for constructing the map of the target area immediately, the result fusion frame meeting the first preset condition needs to be used for constructing the map of the target area in sequence, and a newly obtained result fusion frame meeting the first preset condition needs to be added to the key frame queue so as to construct the map of the target area by using the result fusion frame in the key frame queue.
Generally, a certain number of result fusion frames meeting a first preset condition always exist in the key frame queue, and when the result fusion frames do not meet the first preset condition, the result fusion frames are immediately used for constructing a map of a target area.
Step S15: and constructing a map of the target area by using the key frame queue.
It should be noted that, a local key frame set is obtained according to the result fusion frame in the key frame queue; obtaining a local matching feature point set through the local key frame set; updating the result pose of the result fusion frame and the world coordinates of the matching feature points included in the result fusion frame through a formula II according to the local key frame set and the local matching feature point set to obtain an updated result pose and updated world coordinates; obtaining a selected fused frame based on the result fused frame with the updated result pose and the updated world coordinates; constructing a map of the target area by using the selected fusion frame;
the second formula is:
Figure BDA0002804783600000201
Figure BDA0002804783600000202
where ρ is the kernel function, QijIs the covariance matrix of the j point in the camera coordinate system corresponding to the i camera, BLFor the local key frame set, BFTo include the matching feature point and not in the BLThe results in (1) are fused with a set of frames, N is the total number of cameras,
Figure BDA0002804783600000203
and the result fusion frame comprises a set consisting of all the matched feature points.
Taking out result fusion frames from the key frame queue according to the sequence (the result fusion frames meet a first preset condition), establishing association between the currently taken out result fusion frame and all previously taken out result fusion frames, and establishing association between the currently taken out result fusion frame and matching feature points in the currently taken out result fusion frame; and removing the matching feature points with the total occurrence frequency less than 3 from all the matching feature points included in all the extracted result fusion frames, then removing repeated matching feature points in the currently extracted result fusion frame and the extracted result fusion frame adjacent to the currently extracted result fusion frame, taking the currently extracted result fusion frame and the extracted result fusion frame adjacent to the currently extracted result fusion frame together as a local key frame set, wherein all the matching feature points in the set are local matching feature point sets.
And updating data in the result fusion frame by using the obtained updated result pose and the updated world coordinate, namely replacing the previous result pose and the world coordinate with the updated result pose and the updated world coordinate to obtain a selected fusion frame, and eliminating the matching feature points existing in the co-view selected fusion frame in the selected fusion frame to obtain the map-forming points, wherein the elimination proportion is preferably 90%, the co-view selected fusion frame is the selected key frame with the number of the common identification map points exceeding 15 in all the selected fusion frames obtained at the current moment, and the common identification map points refer to the matching feature points with the same world coordinate.
And finally, constructing a map of the target area by using the map-forming points. It can be understood that the composition points obtained by screening each selected fusion frame are the matching feature points corresponding to the new target point captured by the SLAM system (i.e., the target point, which is the object point corresponding to the feature point and objectively existing in the target area), and the map of the target area is constructed by using the matching feature points.
In specific application, as time goes on, N target images corresponding to each moment are used for obtaining a fusion frame, and a map of a target area is constructed by using composition points corresponding to all the fusion frames in the whole running time.
The technical scheme of the embodiment is that a map construction method is adopted, the map construction method is applied to an SLAM system comprising N cameras, and target images of a target area are acquired through the N cameras; acquiring a fusion frame according to the target image, wherein the target images of the N cameras at the same moment correspond to one fusion frame; judging whether the fusion frame meets a first preset condition or not; when the fused frame meets the first preset condition, adding the fused frame to a key frame queue; and constructing a map of the target area by using the key frame queue. Because the fusion frames meeting the preset conditions are added to the key frame queue when the target area map is constructed, and the key frame queue is utilized to construct the map of the target area, the map of the target area is not constructed by utilizing all the fusion key frames.
Meanwhile, information included in the first formula and the second formula are optimized, and a result map of the target area is obtained by using the optimized result fusion frame, so that the accuracy rate of the result map is high
Further, after step S15, the method further includes: screening candidate key frames from all historical selected fusion frames at the moment before the selected fusion frame; SIM3 transformation is carried out on the selected fusion frame and the candidate key frame to obtain a similarity transformation matrix; utilizing the similarity transformation matrix to adjust the selected fusion frame and the historical selected fusion frame to obtain an adjusted selected fusion frame and an adjusted historical selected fusion frame; correcting the result pose of the selected and adjusted fusion frame, the world coordinates of the matched feature points included in the selected and adjusted fusion frame, the result pose of the selected and adjusted fusion frame and the world coordinates of the matched feature points included in the selected and adjusted fusion frame to obtain a corrected fusion frame pose, a corrected fusion frame world coordinate, a corrected history fusion frame pose and a corrected history fusion frame world coordinate; and correcting the map of the target area by using the corrected fusion frame pose, the corrected fusion frame world coordinate, the corrected historical fusion frame pose and the corrected historical fusion frame world coordinate to obtain a result map of the target area.
It should be noted that, after the map of the target area is constructed by using the selected fusion frame having the updated result pose and the updated world coordinate each time, the selected fusion frame needs to be subjected to the above correction processing, and a result map is obtained.
In specific application, during correction, the result poses of two pairs of the adjusted result fusion frames, the world coordinates of the matching feature points included in the adjusted result fusion frames, the result poses included in the adjusted historical result fusion frames and the world coordinates of the matching feature points included in the adjusted historical result fusion frames are corrected by using the formula, so that corrected fusion frame poses, corrected fusion frame world coordinates, corrected historical fusion frame poses and corrected historical fusion frame world coordinates are obtained; wherein the pose comprises a rotation matrix and a translation vector. And will not be described in detail herein.
Further, the step of screening candidate key frames from all historical selected fusion frames at a time before the selected fusion frame comprises: calculating bag-of-word scores of the selected fusion frame and all historical selected fusion frames corresponding to the selected fusion frame at the previous moment; screening the primary selection fusion frames meeting a third preset condition from all the historical selection fusion frames; screening selected key frames with the word bag score exceeding a preset word bag score threshold value in the primary selection fusion frames; calculating the maximum value of the number of common words of the selected key frame and the selected fusion frame; and screening candidate key frames with the common word number larger than a preset word number threshold value from the selected key frames.
Wherein the third preset condition is realized as follows:
the selected fusion frame connected to the selected fusion frame having the updated result pose and the updated world coordinates forms a "sub-candidate group", and then the selected fusion frame itself is added to the "sub-candidate group"; detecting whether each selected fusion frame in the sub candidate group exists in the continuous group, if so, recording the value of the current continuous variable plus 1, and putting the sub candidate group into the current continuous group; wherein, the step of detecting whether each selected fusion frame in the 'sub-candidate group' exists in the 'continuous group' or not: traversing each 'sub candidate group' to detect whether each selected fusion frame in each 'sub candidate group' exists in the 'continuous group', if a frame coexists in the sub candidate group and the previous 'continuous group', the 'sub candidate group' is continuous with the previous 'continuous group', namely, the 'continuous group' exists; if the value of the current continuous variable is greater than or equal to 3, the selected fusion frame corresponding to the sub-candidate group meets a third preset condition.
In addition, the preset bag-of-words score threshold value can be the lowest score of the bag-of-words score, and can also be other threshold values, so that the user can set the threshold value according to the requirement of the user; the preset word number threshold is the maximum common word number x the preset coefficient, the preset coefficient is preferably 0.8, and the user can set the preset word number threshold according to the requirement of the user, which is not limited herein.
Referring to fig. 6, fig. 6 is a block diagram showing a first embodiment of the map building apparatus according to the present invention; applied to a SLAM system comprising N cameras, the apparatus comprising:
an obtaining module 10, configured to obtain a target image of a target area;
an obtaining module 20, configured to obtain a fusion frame according to the target image, where the target images of the N cameras at the same time correspond to one fusion frame;
a judging module 30, configured to judge whether the fused frame is a key frame;
an adding module 40, configured to add the fused frame to a key frame queue when the fused frame is a key frame;
and a building module 50, configured to build a map of the target area by using the key frame queue.
Referring to FIG. 7, FIG. 7 is a graph comparing the accuracy of the method of the present invention with other mapping methods; wherein the data in the map includes the accuracy of the resulting map of the target area obtained by the multi-camera based SLAM system using the method of the present invention, and the accuracy of the resulting map of the target area constructed using the open source frameworks ORB-SLAM2 and VINS-Stereo.
The above description is only an alternative embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A map construction method applied to a SLAM system including N cameras, the method comprising the steps of:
acquiring a target image of a target area through the N cameras, wherein N is a positive integer greater than or equal to 2;
acquiring a fusion frame according to the target image, wherein the target images of the N cameras at the same moment correspond to one fusion frame;
judging whether the fusion frame meets a first preset condition or not;
when the fused frame meets the first preset condition, adding the fused frame to a key frame queue;
and constructing a map of the target area by using the key frame queue.
2. The map construction method according to claim 1, wherein the fused frame includes matching feature points that satisfy a preset matching condition and monocular feature points that do not satisfy the preset matching condition; before the step of judging whether the fused frame meets the first preset condition, the method further includes:
determining the monocular feature points in the fusion frame, which satisfy the preset matching condition with the matching feature points in the fusion frame at the previous moment, as new matching feature points;
obtaining a result fusion frame according to the newly added matching feature points and the fusion frame;
the step of judging whether the fusion frame meets a first preset condition comprises the following steps:
judging whether the result fusion frame meets a first preset condition or not;
the step of adding the fused frame to a key frame queue when the fused frame meets the first preset condition comprises:
and when the result fusion frame meets the first preset condition, adding the result fusion frame to a key frame queue.
3. The map construction method according to claim 2, wherein the first preset condition includes one of a necessary constraint and a one-out-of-three constraint; wherein the content of the first and second substances,
the necessary constraint is that the total number of the matched feature points of the result fusion frame is not less than a first preset threshold;
the one-out-of-three constraint includes that the number of result fused frames between the result fused frame and a previous result fused frame satisfying the first preset condition exceeds a first preset number of frames, or,
the number of result fusion frames between the result fusion frame and the previous result fusion frame satisfying the first preset condition exceeds a second preset frame number and the number of result fusion frames satisfying the second preset condition in the key frame queue is 0, or,
the number of result fusion frames meeting the second preset condition in the key frame queue does not exceed a second preset threshold;
the first preset frame number is greater than the second preset frame number.
4. The mapping method of claim 3, wherein prior to the step of determining whether the resulting fused frame is a key frame, the method further comprises:
determining a reference camera from the N cameras, and determining a coordinate system corresponding to the reference camera as a world coordinate system;
acquiring internal parameters of the N cameras;
obtaining world coordinates and pixel coordinates of all matched feature points in the result fusion frame;
obtaining the estimated poses of the result fusion frame and the initial moment result fusion frame by using a uniform motion model hypothesis method;
performing iterative optimization on the estimated pose according to the world coordinate system, the internal parameters, the world coordinates and the pixel coordinates through a formula I to obtain a result pose of the result fusion frame;
the step of judging whether the result fusion frame meets a first preset condition comprises the following steps:
judging whether the result fusion frame with the result pose meets a first preset condition or not;
the first formula is as follows:
Figure FDA0002804783590000021
Figure FDA0002804783590000022
wherein i is the ith camera of the N cameras, j is the jth feature point of the all matched feature points included in the result fusion frame, k is the kth moment corresponding to the result fusion frame,
Figure FDA0002804783590000023
h is a re-projection function, P is the re-projection error of the jth characteristic point at the kth moment in a camera coordinate system corresponding to the ith camerajIs the world coordinate of the j-th feature point,
Figure FDA0002804783590000024
is the depth coordinate of the jth characteristic point at the kth moment in the camera coordinate system corresponding to the ith camera,
Figure FDA0002804783590000025
is the abscissa of the jth characteristic point at the kth moment in the camera coordinate system corresponding to the ith camera,
Figure FDA0002804783590000031
is the ordinate, f, of the kth characteristic point in the camera coordinate system corresponding to the ith camera at the kth momentx、fyTwo intrinsic parameters of the ith camera respectively,
Figure FDA0002804783590000032
for the estimated rotation matrix of the reference camera at the kth time in the estimated pose,
Figure FDA0002804783590000033
for the estimated translation vector of the reference camera at the kth time in the estimated pose,
Figure FDA0002804783590000034
matrix obtained for SE3 transformation of said estimated rotation matrix and said estimated translation vector, I3In the form of a third-order identity matrix,
Figure FDA0002804783590000035
is the translation vector of the world coordinate system to the camera coordinate system corresponding to the ith camera,
Figure FDA0002804783590000036
and the rotation matrix is from the world coordinate system to a camera coordinate system corresponding to the ith camera.
5. The mapping method of claim 4, wherein the step of constructing a map of the target area using the keyframe queue comprises:
obtaining a local key frame set according to the result fusion frame in the key frame queue;
obtaining a local matching feature point set through the local key frame set;
updating the result pose of the result fusion frame and the world coordinates of the matching feature points included in the result fusion frame through a formula II according to the local key frame set and the local matching feature point set to obtain an updated result pose and updated world coordinates;
obtaining a selected fused frame based on the result fused frame with the updated result pose and the updated world coordinates;
constructing a map of the target area by using the selected fusion frame;
the second formula is:
Figure FDA0002804783590000037
Figure FDA0002804783590000038
where ρ is the kernel function, QijIs the covariance matrix of the j point in the camera coordinate system corresponding to the i camera, BLFor the local key frame set, BFTo include the matching feature point and not in the BLThe results in (1) are fused with a set of frames, N is the total number of cameras,
Figure FDA0002804783590000039
and the result fusion frame comprises a set consisting of all the matched feature points.
6. The method of map construction according to claim 5, wherein after said step of constructing a map of a target area using said selected fused frame, said method further comprises:
screening candidate key frames from all historical selected fusion frames at the moment before the selected fusion frame;
SIM3 transformation is carried out on the selected fusion frame and the candidate key frame to obtain a similarity transformation matrix;
utilizing the similarity transformation matrix to adjust the selected fusion frame and the historical selected fusion frame to obtain an adjusted selected fusion frame and an adjusted historical selected fusion frame;
correcting the result pose of the selected and adjusted fusion frame, the world coordinates of the matched feature points included in the selected and adjusted fusion frame, the result pose of the selected and adjusted fusion frame and the world coordinates of the matched feature points included in the selected and adjusted fusion frame to obtain a corrected fusion frame pose, a corrected fusion frame world coordinate, a corrected history fusion frame pose and a corrected history fusion frame world coordinate;
and correcting the map of the target area by using the corrected fusion frame pose, the corrected fusion frame world coordinate, the corrected historical fusion frame pose and the corrected historical fusion frame world coordinate to obtain a result map of the target area.
7. The mapping method of claim 6, wherein the step of screening candidate keyframes from all historical selected fused frames at a time prior to the selected fused frame comprises:
calculating bag-of-word scores of the selected fusion frame and all historical selected fusion frames corresponding to the selected fusion frame at the previous moment;
screening the primary selection fusion frames meeting a third preset condition from all the historical selection fusion frames;
screening selected key frames with the word bag score exceeding a preset word bag score threshold value in the primary selection fusion frames;
calculating the maximum value of the number of common words of the selected key frame and the selected fusion frame;
and screening candidate key frames with the common word number larger than a preset word number threshold value from the selected key frames.
8. A map construction apparatus applied to a SLAM system including N cameras, the apparatus comprising:
the acquisition module is used for acquiring a target image of a target area through the N cameras, wherein N is a positive integer greater than or equal to 2;
an obtaining module, configured to obtain a fusion frame according to the target image, where the target images of the N cameras at the same time correspond to one fusion frame;
the judging module is used for judging whether the fusion frame meets a first preset condition or not;
the adding module is used for adding the fusion frame to a key frame queue when the fusion frame meets the first preset condition;
and the construction module is used for constructing a map of the target area by utilizing the key frame queue.
9. A SLAM system, comprising: n cameras, a memory, a processor and a map building program stored on the memory and running on the processor, the map building program when executed by the processor implementing the steps of the map building method of any one of claims 1 to 7; wherein N is a positive integer greater than or equal to 2.
10. A storage medium, characterized in that it has stored thereon a mapping program which, when executed by a processor, implements the steps of the mapping method according to any one of claims 1 to 7.
CN202011368165.6A 2020-11-27 2020-11-27 Map construction method, map construction device, SLAM system, and storage medium Pending CN112446845A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011368165.6A CN112446845A (en) 2020-11-27 2020-11-27 Map construction method, map construction device, SLAM system, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011368165.6A CN112446845A (en) 2020-11-27 2020-11-27 Map construction method, map construction device, SLAM system, and storage medium

Publications (1)

Publication Number Publication Date
CN112446845A true CN112446845A (en) 2021-03-05

Family

ID=74738798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011368165.6A Pending CN112446845A (en) 2020-11-27 2020-11-27 Map construction method, map construction device, SLAM system, and storage medium

Country Status (1)

Country Link
CN (1) CN112446845A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927308A (en) * 2021-03-26 2021-06-08 鹏城实验室 Three-dimensional registration method, device, terminal and computer readable storage medium
CN116630442A (en) * 2023-07-19 2023-08-22 绘见科技(深圳)有限公司 Visual SLAM pose estimation precision evaluation method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927308A (en) * 2021-03-26 2021-06-08 鹏城实验室 Three-dimensional registration method, device, terminal and computer readable storage medium
CN112927308B (en) * 2021-03-26 2023-09-26 鹏城实验室 Three-dimensional registration method, device, terminal and computer readable storage medium
CN116630442A (en) * 2023-07-19 2023-08-22 绘见科技(深圳)有限公司 Visual SLAM pose estimation precision evaluation method and device
CN116630442B (en) * 2023-07-19 2023-09-22 绘见科技(深圳)有限公司 Visual SLAM pose estimation precision evaluation method and device

Similar Documents

Publication Publication Date Title
CN108322724B (en) Image solid matching method and binocular vision equipment
EP2863362B1 (en) Method and apparatus for scene segmentation from focal stack images
EP3534250B1 (en) Target detection method and unmanned aerial vehicle
CN112446845A (en) Map construction method, map construction device, SLAM system, and storage medium
CN111062400B (en) Target matching method and device
CN106934351A (en) Gesture identification method, device and electronic equipment
CN112561973A (en) Method and device for training image registration model and electronic equipment
CN113095106A (en) Human body posture estimation method and device
CN113301320B (en) Image information processing method and device and electronic equipment
CN113676713A (en) Image processing method, apparatus, device and medium
CN113112542A (en) Visual positioning method and device, electronic equipment and storage medium
CN115410274A (en) Gesture recognition method and device and storage medium
CN112614110B (en) Method and device for evaluating image quality and terminal equipment
CN109785367B (en) Method and device for filtering foreign points in three-dimensional model tracking
CN112446846A (en) Fusion frame obtaining method, device, SLAM system and storage medium
CN111291611A (en) Pedestrian re-identification method and device based on Bayesian query expansion
CN109816709B (en) Monocular camera-based depth estimation method, device and equipment
CN110276737A (en) Image optimization processing method, device, equipment and storage medium
CN111738034B (en) Lane line detection method and device
CN112200730B (en) Image filtering processing method, device, equipment and storage medium
CN114565777A (en) Data processing method and device
CN112669346A (en) Method and device for determining road surface emergency
CN110910312A (en) Image processing method and device, automatic driving vehicle and electronic equipment
CN112601029B (en) Video segmentation method, terminal and storage medium with known background prior information
CN117119119B (en) Compression transmission method, device and system for image data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination