CN112435278B

CN112435278B - Visual SLAM method and device based on dynamic target detection

Info

Publication number: CN112435278B
Application number: CN202110100524.8A
Authority: CN
Inventors: 徐雪松; 曾昱
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-05-04
Anticipated expiration: 2041-01-26
Also published as: CN112435278A

Abstract

The invention discloses a visual SLAM method based on dynamic target detection, which comprises the steps of temporarily removing potential dynamic areas of images by using a target detection network Yolov3, solving a motion compensation frame through a reprojection error optimization homography matrix to obtain a four-frame difference image, then carrying out filtering, binaryzation and morphological processing on the four-frame difference image, simultaneously optimizing a dynamic target detection result by combining a Yolov3 network to obtain an improved dynamic target area, and finally carrying out tracking, image building and loop detection on the visual SLAM by using characteristic points of a static area. According to the method, a potential dynamic region in a scene is removed by adopting a deep learning target detection network, a homography matrix is roughly estimated, and whether characteristic points on the potential dynamic region can be used for calculating the homography matrix is judged based on a method of combining a reprojection error and an inter-class variance so as to optimize the homography matrix, so that the accuracy of the homography matrix is improved.

Description

Visual SLAM method and device based on dynamic target detection

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a visual SLAM method and device based on dynamic target detection.

Background

Meanwhile, a positioning and mapping (SLAM) technology is increasingly and widely applied to the fields of robot positioning, unmanned driving and the like, wherein a visual sensor has the characteristics of convenience in carrying and low cost, so that the visual sensor is widely applied to the SLAM technology, most of the traditional visual SLAM algorithms assume that a camera is in a static environment, such as Orbslam2, DSO, SVO and the like, and when a scene has a dynamic area, feature points extracted by the visual SLAM on a dynamic object influence the accuracy of the algorithm.

Aiming at the problem of accuracy reduction of a visual odometer in a dynamic scene, a commonly adopted method is as follows: the method comprises the steps of detecting an image advanced mobile dynamic object, reserving feature points of a static area for tracking and mapping the visual SLAM after removing feature points of the dynamic area, but in an image with a large dynamic area, the accuracy of tracking and mapping of the visual SLAM can be greatly influenced after removing the dynamic area.

The defects in the prior art are mainly caused by the following reasons: by using the deep learning target detection network alone, objects with mobility, such as people and automobiles, can be classified into potential dynamic targets in advance, but whether the potential dynamic targets are in a real motion state cannot be judged, and if the potential dynamic targets are in a static state, too many static feature points can be removed. An algorithm for dynamic detection by combining depth information is needed, and when the depth information of some areas of an image is uncertain or when the depth of a foreground and a background are close, the classification may be inaccurate.

Disclosure of Invention

The present invention provides a visual SLAM method and apparatus based on dynamic target detection, which is used to solve at least one of the above technical problems.

The invention provides a visual SLAM method based on dynamic target detection, which comprises the following steps: performing region segmentation on each image frame based on a deep learning target detection network in response to each acquired image frame, wherein each image frame comprises a potential dynamic region and/or a static region, the potential dynamic region comprises a motion feature point and/or a first static feature point, and the static region comprises a second static feature point; matching the second static characteristic point of the previous frame image with the second static characteristic point of the current frame image; responding to the acquired matching relation, and calculating to obtain a first homography matrix based on a RANSAC (random Sample consensus) algorithm; respectively extracting a first static feature point of the previous frame image and a first static feature point of the current frame image based on a motion feature point filtering method, wherein the motion feature point filtering method is a method formed by combining a reprojection error of feature points with a maximum inter-class variance method; optimizing the first homography matrix and obtaining a second homography matrix based on the matching relation between the first static characteristic point of the previous frame image and the first static characteristic point of the current frame image; and performing motion compensation on the previous frame image according to the second homography matrix so as to obtain a motion compensation frame image.

The invention provides a visual SLAM device based on dynamic target detection, which comprises: the segmentation module is configured to perform region segmentation on each image frame based on a deep learning target detection network in response to the acquired image frame, wherein each image frame comprises a potential dynamic region and/or a static region, the potential dynamic region comprises a motion feature point and/or a first static feature point, and the static region comprises a second static feature point; the matching module is configured to match the second static feature point of the previous frame image with the second static feature point of the current frame image; the calculation module is configured to respond to the acquired matching relationship and calculate to obtain a first homography matrix based on a RANSAC algorithm; the extraction module is configured to respectively extract the first static feature point of the previous frame image and the first static feature point of the current frame image based on a motion feature point filtering method, wherein the motion feature point filtering method is a method formed by combining a reprojection error of feature points with a maximum inter-class variance method; the optimization module is configured to optimize the first homography matrix and obtain a second homography matrix based on the matching relation between the first static feature point of the previous frame image and the first static feature point of the current frame image; and the compensation module is configured to perform motion compensation on the previous frame image according to the second homography matrix, so that a motion compensation frame image is obtained.

An electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the visual SLAM method based on dynamic object detection of the present invention.

The present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the dynamic object detection based visual SLAM method of the present invention.

The method and the device adopt a deep learning target detection network to eliminate potential dynamic regions in a scene, roughly estimate a homography matrix, judge whether feature points on the potential dynamic regions can be used for calculation of the homography matrix or not based on a method of combining a reprojection error and an inter-class variance, and optimize the homography matrix, so that the precision of the homography matrix is effectively improved, the result of motion compensation is further optimized, and a dynamic target in an image can be accurately obtained through a frame difference method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a visual SLAM method based on dynamic target detection according to an embodiment of the present invention;

fig. 2 is a flowchart of a visual SLAM method based on dynamic target detection according to another embodiment of the present invention;

FIG. 3 is a flowchart of a visual SLAM method based on dynamic target detection according to another embodiment of the present invention;

FIG. 4 is a diagram illustrating an effect of detecting a motion region when an image is blurred according to an embodiment of the present invention;

fig. 5 is a block diagram of a visual SLAM device based on dynamic target detection according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a visual SLAM method based on dynamic object detection according to the present application is shown.

As shown in fig. 1, the visual SLAM method based on dynamic object detection includes the following steps:

in S101, responding to each acquired image frame, performing region segmentation on each image frame based on a deep learning target detection network, wherein each image frame comprises a potential dynamic region and/or a static region, the potential dynamic region comprises a motion feature point and/or a first static feature point, and the static region comprises a second static feature point;

in this embodiment, in response to each acquired image frame, each image frame is subjected to region segmentation based on a deep learning target detection network, the deep learning target detection network adopts a Darknet53 network and a multi-scale feature to perform target detection, and has good identification speed and accuracy, common objects with motility, such as pedestrians and vehicles, can be effectively identified, such objects with motility are classified as potential dynamic objects, the region where the potential dynamic objects are located is a potential dynamic region, the potential dynamic region contains a movement feature point and/or a first static feature point, the region where the static objects are located is a static region, and the static region contains a second static feature point.

According to the scheme of the implementation, a deep learning target detection network is adopted for target detection, dynamic target detection is carried out on each image frame, and potential dynamic areas and/or static areas in each image frame are screened and segmented, wherein each image frame possibly comprises a potential dynamic area, and the potential dynamic area possibly comprises a first static characteristic point, so that the potential dynamic area can be removed temporarily by a subsequent vision SLAM device conveniently for characteristic point matching, and a homography matrix is roughly calculated by adopting a RANSAC algorithm.

In S102, the second still feature point of the previous frame image and the second still feature point of the current frame image are matched.

In this embodiment, the second still feature point of the previous frame image and the second still feature point of the current frame image are matched, so as to obtain a matching relationship between the second still feature point of the previous frame image and the second still feature point of the current frame image.

In S103, in response to the obtained matching relationship, a first homography matrix is calculated based on the RANSAC algorithm.

In this embodiment, in response to the obtained matching relationship, the first homography matrix is calculated based on the RANSAC algorithm, and specifically, in a scene with a small dynamic area, the calculated first homography matrix can directly perform motion compensation on an image.

In S104, a first stationary feature point of the previous frame image and a first stationary feature point of the current frame image are respectively extracted based on a motion feature point filtering method, where the motion feature point filtering method is a method in which a reprojection error of feature points is combined with a maximum inter-class variance method.

In this embodiment, in order to determine whether the feature point of the potential dynamic region can be used to calculate the homography matrix H, a method of combining the feature point reprojection error and the maximum inter-class variance is adopted to determine whether the first stationary feature point of the potential dynamic region can be used to calculate the homography matrix H.

In the embodiment, a method based on a combination of a reprojection error and an inter-class variance is adopted to determine whether feature points on a potential dynamic region can be used for calculating a first homography matrix, so as to subsequently optimize the first homography matrix, wherein the method based on a combination of a reprojection error and an inter-class variance is used for determining whether feature points on a potential dynamic region can be used for calculating the first homography matrix in the following specific steps:

suppose that

And

is the characteristic point matched with the previous and next frames and the homography matrix

Satisfies the formula (1). Assuming that the front and rear frames have N pairs of matched feature points, the front and rear frames have N re-projection errors, and the re-projection errors of one pair of matched feature points can be calculated

The formula (2) is shown in the formula (2). Dividing the N reprojection errors into

Stage (1) to

The number of the level feature points is

Wherein

Therefore, there are

。

（1）

（2）

Let the average of the N reprojection errors be

. Set of first stationary feature points and second stationary feature points

Record as

Set of dynamic feature points

Record as

Is provided with

In a ratio of

Set of dynamic feature points

In a ratio of

，

、

、

As shown in equation (3), the mean of the first stationary feature point set

And mean of dynamic feature point set

As shown in formula (4).

（3）

（4）

Thus, inter-class variance can be estimated

As shown in formula (5):

（5）

the formula (5) can be simplified to the formula (7) according to the formula (6).

（6）

（7）

Traversing between 0 and k, and enabling variance

The maximum residual distance is recorded as

If the reprojection error of a certain pair of matching points

And then the feature points are the dynamic feature points,

then it is the first stationary feature point or the secondTwo stationary feature points.

In S105, the first homography matrix is optimized and a second homography matrix is obtained based on the matching relationship between the first stationary feature point of the previous frame image and the first stationary feature point of the current frame image.

In this embodiment, the first homography matrix is optimized and the second homography matrix is obtained based on the matching relationship between the first static feature point of the previous frame image and the first static feature point of the current frame image.

In S106, motion compensation is performed on the previous frame image according to the optimized second homography matrix, so that a motion compensated frame image is obtained.

In the scheme of this embodiment, a matching relationship between the first static feature point of the previous frame image and the first static feature point of the current frame image is adopted to optimize the first homography matrix to obtain the second homography matrix, and motion compensation is performed on the image according to the second homography matrix, so that the precision of the motion compensation frame image can be effectively improved, specifically, according to the second homography matrix, an expression for performing motion compensation on the previous frame image is as follows:

；

in the formula (I), the compound is shown in the specification,

the pixel points of the previous frame are the pixels of the previous frame,

is composed of

The compensated pixel points are displayed on the display screen,

a homography matrix of a previous frame and a current frame;

according to the method, the traditional visual SLAM is used in a static environment, when a dynamic object exists in a scene, the accuracy of the visual SLAM is reduced, the dynamic object in the image is mainly detected, and the accuracy of the SLAM is improved. When the camera moves, the current frame image can be subjected to motion compensation and then subjected to a frame difference method to obtain a dynamic region in the picture.

When the translation distance of the camera is small relative to the depth of the scene, the homography matrix H may be used as the motion compensation matrix. When the homography matrix H is calculated, images of previous and subsequent frames need to be matched, and if dynamic objects exist in a scene, the estimation of the homography matrix H is inaccurate. A potential dynamic object in a scene is eliminated by adopting a deep learning target detection network, and a homography matrix H is roughly estimated. The deep learning target detection network cannot judge whether the potential dynamic object is in a real motion state, if the potential dynamic object is in a static state, the feature points on the potential dynamic object can also participate in the calculation of the homography matrix H, so that the precision of the homography matrix H is improved, and whether the feature points on the potential dynamic object can be used for the calculation of the homography matrix H or not is judged by a method of combining a reprojection error and an inter-class variance, so that the precision of the homography matrix H is improved.

Referring to fig. 2, a flowchart illustrating steps further defined in the additional flow of fig. 1 is shown, according to yet another embodiment of the visual SLAM method based on dynamic object detection of the present application.

As shown in fig. 2, in S201, the motion-compensated frame image is differenced with the current frame image so as to obtain a frame difference map;

in S202, analyzing the frame difference map subjected to denoising and morphological processing based on a connected region algorithm, so as to determine a dynamic region, where the dynamic region only includes motion feature points;

in S203, the current frame image is subjected to dynamic region elimination, and tracking, image creation, and loop detection of the visual SLAM are performed based on the current frame image from which the dynamic region is eliminated.

In this embodiment, for S201, the motion compensation frame image and the current frame image are subjected to difference, so as to obtain a frame difference map, where the expression for performing difference on the motion compensation frame image and the current frame image is:

in the formula (I), wherein,

is as follows

Is framed in

The value of the pixel of (a) is,

is as follows

The compensation frame is at

The value of the pixel of (a) is,

is the t-th frame

Then, for S202, based on the connected component algorithm, the frame difference map subjected to denoising and morphological processing is analyzed to determine a dynamic component, where the dynamic component only includes the motion feature points. Then, for S203, the current frame image is subjected to dynamic region elimination, and tracking, mapping, and loop detection of the visual SLAM are performed based on the current frame image from which the dynamic region is eliminated.

Please refer to fig. 3, which shows a flowchart of another embodiment of the visual SLAM method based on dynamic target detection according to the present application, and the flowchart mainly refers to a flowchart of the step of analyzing the frame difference map processed by denoising and morphology based on the connected component area algorithm to further define the condition of determining the dynamic component area at S202.

As shown in fig. 3, in S301, in response to the acquired frame difference map, the frame difference map is denoised based on filtering and binarization processing so as to be a binary map;

in S302, in response to the obtained binary image, setting each pixel value of the static region in the binary image to zero based on the deep learning target detection network;

in S303, the processed binary image is morphologically processed, and a dynamic region is obtained based on a connected region algorithm analysis.

In the present embodiment, for S301, in response to the acquired frame difference map, the frame difference map is denoised based on filtering and binarization processing so that a binary map is obtained. Thereafter, for S302, in response to the acquired binary image, the deep learning target detection network sets each pixel value of the static region in the binary image to zero. Then, in step S303, the processed binary image is morphologically processed, and a dynamic region is obtained based on a connected region algorithm analysis.

According to the method, when a strong parallax scene or an image is blurred, the result of dynamic target detection is optimized by combining a deep learning target detection network, so that the influence of blurring noise is reduced.

In a particular embodiment, the potential dynamic region is a region containing potential dynamic objects, wherein the potential dynamic objects are pedestrians or vehicles.

In some optional embodiments, the deep learning target detection network is a Yolov3 network. Therefore, the Darknet53 network and the multi-scale features are adopted for target detection, so that the method has better recognition speed and precision, and can effectively recognize common objects with motility, such as pedestrians, vehicles and the like.

It should be noted that the above method steps are not intended to limit the execution order of the steps, and in fact, some steps may be executed simultaneously or in the reverse order of the steps, which is not limited herein.

In some optional embodiments, the visual SLAM method based on dynamic target detection comprises the following steps:

(1) performing frame processing on the image to obtain each image frame;

(2) extracting feature points from a previous frame image and a current frame image;

(3) detecting a dynamic target through a Yolov3 network, and removing the dynamic target;

(4) matching a second static characteristic point in the previous frame image with a second static characteristic point of the current frame image, calculating by using a RANSAC algorithm based on a matching relation to obtain a first homography matrix, matching the first static characteristic point extracted from the dynamic target of the previous frame image with the first static characteristic point extracted from the dynamic target of the current frame image, and optimizing the first homography matrix based on the matching relation to obtain a second homography matrix;

(5) performing image compensation on the previous frame of image through a second homography matrix, and obtaining a four-frame difference image through a four-frame difference method (respectively performing difference on adjacent four frames of images), specifically, sequentially performing difference on the t frame, the t-1 th frame, the t-2 th frame and the t-3 th frame (the t frame-the t-1 th frame, the t-1 th frame-the t-2 th frame and the t-2 th frame-the t-3 rd frame), respectively obtaining two-frame difference images (respectively: (a first homography matrix is used for performing image compensation on the previous frame of image, and performing difference on the adjacent four frames of images) to obtain a four-frame difference image

、

、

) Calculating to obtain a four-frame difference map

，

。

(6) And after obtaining the four-frame difference image, further denoising the image by using filtering and binarization, and judging the dynamic target by using a connected region algorithm after morphological processing.

(7) And removing the truly moving dynamic target, and performing tracking, drawing and loop detection of the visual SLAM by using the first static characteristic point of the dynamic area and the second static characteristic point of the static area.

As shown in fig. 4, image blurring may be caused by the motion of the camera, as shown in fig. 4 (a), the motion-compensated image is blurred, or if a strong parallax is generated during the motion of the camera, the motion compensation matrix is not calculated accurately, and the motion compensation effect is not ideal enough, in such cases, the binary image of the above method cannot process excessive background noise, so that a pixel point with a value of not 0 is also present in a static region, which is noise generated by image blurring, and many static regions are also determined as dynamic regions by mistake, as shown in fig. 4 (b). In order to eliminate background noise during image blurring, a binary image is optimized by combining with a Yolov3 network, and a pixel value of a non-potential dynamic region in the binary image is set to be 0, so that a final detection result becomes fig. 4 (c), and compared with fig. 4 (b) and fig. 4 (c), it can be seen that a dynamic target identified by a box in fig. 4 (c) is more accurate, and background false detection is also obviously reduced.

After the dynamic object of the image is solved according to the process, the image building and loop detection of the subsequent visual SLAM are carried out through the reserved characteristic points of the static area.

Testing was performed using the TUM (Technische Universal ä t M hunchen) dataset and quantitative assessments were obtained using Absolute Track Error (ATE). In the TUM data set, a prefix is walking and belongs to a high dynamic sequence, and a prefix is a low dynamic sequence; the suffix rpy represents that the camera rotates in three directions of r-p-y, xyz represents that the camera moves in the x-y-z direction, halfsphere means that the camera also increases arc motion on the basis of rpy and xyz, and static means that the camera remains almost still.

The comparison result of the algorithm of the application and other algorithms is shown in table 1, and Orbslam2 is an original algorithm without dynamic filtering; "DVO + MR (Dynamic visual interaction + motion removal)" judges a Dynamic object using a motion compensation algorithm; the map point weight sets weight for the feature point to judge whether the feature point is a dynamic feature point or not, and the accuracy of depth information is depended on; the DS-SLAM judges dynamic feature points by adopting a method combining deep learning and geometric constraint; the 'orbslam 2+ Yolov 3' is an algorithm that orbslam2 is directly combined with target detection Yolov3, and feature points of dynamic regions under semantics are filtered out without distinction.

Compared with the Root Mean Square Error (RMSE) of the absolute track error in table 1, the orbslam2 algorithm has higher precision in a low dynamic data set and larger precision error in a high dynamic data set. Due to the fact that the walking _ rpy data set has a partially blurred and strong parallax image, after dynamic feature points are filtered, the number of remaining feature points available for tracking is reduced, and therefore algorithm tracking fails. The effect brought by a blurred image and strong parallax is reduced due to the combination of the Yolov3, and the robustness is improved to a certain extent. In the walking _ halfsphere data set, due to the fact that the camera moves in a strong parallax environment, certain influences are caused on calculation of a homography matrix and motion compensation, and the precision is reduced compared with that of a DS-SLAM.

Referring to fig. 5, a block diagram of a visual SLAM device based on dynamic object detection according to the present application is shown.

As shown in fig. 5, the visual SLAM device includes a segmentation module 410, a matching module 420, a calculation module 430, an extraction module 440, an optimization module 450, and a compensation module 460.

The segmentation module 410 is configured to perform region segmentation on each image frame based on a deep learning target detection network in response to each acquired image frame, where each image frame includes a potential dynamic region and/or a static region, the potential dynamic region includes a motion feature point and/or a first static feature point, and the static region includes a second static feature point; a matching module 420 configured to match the second still feature point of the previous frame image with the second still feature point of the current frame image; a calculating module 430, configured to obtain a first homography matrix by calculating based on a RANSAC algorithm in response to the obtained matching relationship; the extracting module 440 is configured to extract a first stationary feature point of the previous frame image and a first stationary feature point of the current frame image based on a motion feature point filtering method, wherein the motion feature point filtering method is a method formed by combining a reprojection error of feature points with a maximum inter-class variance method; the optimization module 450 is configured to optimize the first homography matrix and obtain a second homography matrix based on a matching relationship between the first static feature point of the previous frame image and the first static feature point of the current frame image; and a compensation module 460 configured to perform motion compensation on the previous frame image according to the second homography matrix, so as to obtain a motion compensated frame image.

It should be understood that the modules recited in fig. 5 correspond to various steps in the methods described with reference to fig. 1, 2, and 3. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 5, and are not described again here.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, where the computer-executable instructions may perform the visual SLAM method based on dynamic target detection in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

responding to each acquired image frame, and performing region segmentation on each image frame based on a deep learning target detection network, wherein each image frame comprises a potential dynamic region and/or a static region, the potential dynamic region comprises a motion characteristic point and/or a first static characteristic point, and the static region comprises a second static characteristic point;

matching the second static characteristic point of the previous frame image with the second static characteristic point of the current frame image;

responding to the acquired matching relation, and calculating to obtain a first homography matrix based on a RANSAC algorithm;

respectively extracting a first static characteristic point of a previous frame image and a first static characteristic point of a current frame image based on a motion characteristic point filtering method, wherein the motion characteristic point filtering method is a method formed by combining a reprojection error of the characteristic points with a maximum inter-class variance method;

optimizing the first homography matrix and obtaining a second homography matrix based on the matching relation between the first static characteristic point of the previous frame image and the first static characteristic point of the current frame image;

and performing motion compensation on the previous frame image according to the second homography matrix so as to obtain a motion compensation frame image.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of a visual SLAM device based on dynamic object detection, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory remotely located from the processor, and these remote memories may be connected over a network to a visual SLAM device based on dynamic object detection. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the above-mentioned visual SLAM methods based on dynamic object detection.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: the processor 510 and the memory 520 are illustrated as one processor 510 in fig. 6. The apparatus of the visual SLAM method based on dynamic object detection may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, such as by a bus connection in fig. 6. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications of the server and data processing, i.e., implementing the visual SLAM method based on dynamic object detection of the above-described method embodiments, by running non-volatile software programs, instructions, and modules stored in the memory 520. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the visual SLAM device based on dynamic object detection. The output device 540 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a visual SLAM device based on dynamic object detection, and is used for a client, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A visual SLAM method based on dynamic object detection, comprising:

performing region segmentation on each image frame based on a deep learning target detection network in response to each acquired image frame, wherein each image frame comprises a potential dynamic region and a static region, the potential dynamic region comprises a motion feature point and a first static feature point, and the static region comprises a second static feature point;

respectively extracting a first static feature point of the previous frame image and a first static feature point of the current frame image based on a motion feature point filtering method, wherein the motion feature point filtering method is a method formed by combining a reprojection error of feature points with a maximum inter-class variance method, and the specific steps of respectively extracting the first static feature point of the previous frame image and the first static feature point of the current frame image based on the motion feature point filtering method are as follows:

suppose that

And

is the characteristic point matched with the front and back frames, the characteristic point matched with the front and back frames and the homography matrix

Satisfy the relation of

；

Assuming that the front and rear frames have N pairs of matched feature points, the front and rear frames have N re-projection errors, and the re-projection errors of one pair of matched feature points can be calculated

Is of the formula

；

Dividing the N reprojection errors into

Stage (1) to

The number of the level feature points is

Wherein

Therefore, there are

；

Let the average of the N reprojection errors be

，

Set of first stationary feature point and second stationary feature point

Is composed of

Set of dynamic feature points

Is composed of

Is provided with

In a ratio of

，

Set of dynamic feature points

In a ratio of

，

Mean of the point sets of the first stationary feature point and the second stationary feature point

，

Mean of a set of points of dynamic feature points

，

；

Thus, inter-class variance can be estimated

The formula of (1) is:

；

based on the formula

To, for

Simplification yields the formula:

；

traversing between 0 and k, and enabling variance

The maximum residual distance is recorded as

If the reprojection error of a certain pair of matching points

And then the feature points are the dynamic feature points,

if the feature point is the first static feature point or the second static feature point;

2. A visual SLAM method based on dynamic object detection as claimed in claim 1, wherein after motion compensating said previous frame image according to said second homography matrix, such that it is a motion compensated frame image, said method further comprises:

performing difference on the motion compensation frame image and the current frame image to obtain a frame difference image;

analyzing the frame difference image subjected to denoising and morphological processing based on a connected region algorithm to determine a dynamic region, wherein the dynamic region only comprises motion characteristic points;

and removing the dynamic area from the current frame image, and performing tracking, drawing building and loop detection of the visual SLAM based on the current frame image from which the dynamic area is removed.

3. The visual SLAM method based on dynamic object detection as claimed in claim 2 wherein said connected component area algorithm analyzes said frame difference map that has been denoised and morphologically processed to determine dynamic areas comprising:

responding to the acquired frame difference image, denoising the frame difference image based on filtering and binarization processing to obtain a binary image;

setting each pixel value of a static area in the binary image to be zero based on the deep learning target detection network in response to the acquired binary image;

and carrying out morphological processing on the processed binary image, and analyzing to obtain a dynamic region based on the connected region algorithm.

4. A visual SLAM method based on dynamic object detection as claimed in claim 1, wherein said expression for motion compensation of said previous frame image according to said second homography matrix is:

；

in the formula (I), the compound is shown in the specification,

the pixel points of the previous frame are the pixels of the previous frame,

is composed of

The compensated pixel points are displayed on the display screen,

is a homography matrix of the previous frame and the current frame.

5. The visual SLAM method based on dynamic object detection as claimed in claim 1 wherein the potential dynamic area is an area containing potential dynamic objects, wherein the potential dynamic objects are pedestrians or vehicles.

6. The visual SLAM method based on dynamic target detection of claim 1 wherein the deep learning target detection network is a Yolov3 network.

7. A visual SLAM apparatus based on dynamic object detection, comprising:

the segmentation module is configured to perform region segmentation on each image frame based on a deep learning target detection network in response to the acquired image frame, wherein each image frame comprises a potential dynamic region and a static region, the potential dynamic region comprises a motion feature point and a first static feature point, and the static region comprises a second static feature point;

the matching module is configured to match the second static feature point of the previous frame image with the second static feature point of the current frame image;

the calculation module is configured to respond to the acquired matching relationship and calculate to obtain a first homography matrix based on a RANSAC algorithm;

the extraction module is configured to extract the first stationary feature point of the previous frame image and the first stationary feature point of the current frame image respectively based on a motion feature point filtering method, wherein the motion feature point filtering method is a method formed by combining a reprojection error of feature points with a maximum inter-class variance method, and the specific steps of extracting the first stationary feature point of the previous frame image and the first stationary feature point of the current frame image respectively based on the motion feature point filtering method are as follows:

suppose that

And

Satisfy the relation of

；

Is of the formula

；

Dividing the N reprojection errors into

Stage (1) to

The number of the level feature points is

Wherein

Therefore, there are

；

Let the average of the N reprojection errors be

，

Set of first stationary feature point and second stationary feature point

Is composed of

Set of dynamic feature points

Is composed of

Is provided with

In a ratio of

，

Set of dynamic feature points

In a ratio of

，

，

Mean of a set of points of dynamic feature points

，

；

Thus, inter-class variance can be estimated

The formula of (1) is:

；

based on the formula

To, for

Simplification yields the formula:

；

traversing between 0 and k, and enabling variance

The maximum residual distance is recorded as

If the reprojection error of a certain pair of matching points

And then the feature points are the dynamic feature points,

the optimization module is configured to optimize the first homography matrix and obtain a second homography matrix based on the matching relation between the first static feature point of the previous frame image and the first static feature point of the current frame image;

and the compensation module is configured to perform motion compensation on the previous frame image according to the second homography matrix, so that a motion compensation frame image is obtained.

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of claims 1 to 6.

9. A storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 6.