CN113689459A

CN113689459A - GMM (Gaussian mixture model) combined with YOLO (YOLO) based real-time tracking and graph building method in dynamic environment

Info

Publication number: CN113689459A
Application number: CN202110869065.XA
Authority: CN
Inventors: 刘佳; 顾淇尧; 闫冬; 钱昌宇; 卞方舟
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-11-23
Anticipated expiration: 2041-07-30
Also published as: CN113689459B

Abstract

The invention discloses a GMM-based and YOLO-based real-time tracking and graph building method in a dynamic environment, which comprises the following steps: (1) extracting characteristic points of each frame of image in the dynamic image, and correcting the non-key frame by using an affine transformation matrix; (2) training images at a non-key frame stage by using a Gaussian mixture model, modeling background images and segmenting foreground dynamic regions by using GMM (Gaussian mixture model); (3) inputting the non-key frame image trained in the step (2) and the key frame image in the step (1) into a YOLO detector, tracking and predicting the image of the YOLO detector by using a particle filter algorithm, removing dynamic feature points detected by the current frame, and inserting the key frame for map construction. The invention utilizes the global discontinuous characteristic of the key frame, trains the background image through GMM, segments the foreground dynamic area and provides prior for YOLO; by utilizing the advantages of high speed and good robustness of YOLOv3, the dynamic target detection between continuous frames is realized, and the detection precision of a dynamic area is improved.

Description

GMM (Gaussian mixture model) combined with YOLO (YOLO) based real-time tracking and graph building method in dynamic environment

Technical Field

The invention relates to the field of image processing, in particular to a GMM-based and YOLO-based real-time tracking and mapping method in a dynamic environment.

Background

The SLAM is used for simultaneously positioning and mapping, can solve the motion problem of the robot in an unknown environment, and constructs an environment map by timely feeding back the posture and the motion track of the robot after the robot observes the environment. Early SLAM systems mainly utilized sensors such as single line laser radar, sonar to realize self location, along with the rapid development of computer vision, the vision SLAM system through camera and IMU relies on its advantage convenient and with low costs, has obtained wide application in fields such as robot, AR map construction, unmanned driving.

The traditional dynamic target detection method comprises an optical flow method, an interframe difference method and a background elimination method, and has the following problems: the optical flow method is greatly influenced by the change of scene brightness, and the frame difference method is greatly influenced by noise, so that the phenomena of false detection, missing detection and the like exist in the target detection process, the target is drifted during tracking, and the target tracking precision is further influenced.

In many researches, a VSLAM system is established in a static environment, however, a real environment is more complex, a plurality of dynamic targets such as people and cars often exist in a plurality of scenes such as classrooms, hospitals, shopping places and the like, and many VSLAM systems do not have the adaptability of complex scenes, so that errors are generated in a map point and a pose matrix obtained through calculation. With the improvement of the tracking precision requirement, the number of target tracking is increased, and the traditional filtering algorithm cannot provide a good tracking effect.

Disclosure of Invention

The purpose of the invention is as follows: in view of the above problems, the present invention aims to provide a GMM-based and YOLO-based real-time tracking and mapping method in a dynamic environment, so as to accurately remove dynamic areas in the dynamic environment, and further achieve stable tracking and mapping.

The technical scheme is as follows: the invention discloses a real-time tracking and mapping method based on combination of GMM and YOLO in a dynamic environment, which comprises the following steps:

(1) extracting characteristic points of each frame of image in the dynamic image, dividing the characteristic points into a key frame and a non-key frame, calculating an affine transformation matrix between two adjacent frames in the non-key frame, and correcting the non-key frame by using the affine transformation matrix;

(2) training images at a non-key frame stage by using a Gaussian Mixture Model (GMM), and modeling background images and segmenting foreground dynamic regions by using the GMM;

(3) inputting the non-key frame image trained in the step (2) and the key frame image in the step (1) into a YOLO detector, tracking and predicting the image of the YOLO detector by using a particle filter algorithm, removing dynamic feature points detected by the current frame, and inserting the key frame for map construction.

Further, the relationship of the affine transformation matrix in the step (1) is as follows:

where the left matrix of the equation

Representing the current frame coordinates, right of equation

A matrix of the simulated transformation is represented,

representing the coordinates of the previous frame of the current frame.

Further, the modeling process of step (2) includes:

(21) and (3) matching each pixel point of the non-key frame stage image by using the GMM, finding a corresponding model of each pixel point in the K-type normal distribution model, and if the pixel point belongs to the current normal distribution model, satisfying the following formula:

|X_t-μ_i,t-1|≤2.5σ_i,t-1

in the formula X_tRepresenting pixel points to be matched, i representing the category corresponding to the normal distribution model, mu_i,t-1Represents the pixel mean value, sigma corresponding to the ith normal distribution model at the t-1 moment_i,t-1Representing the standard deviation of all pixels corresponding to the ith normal distribution model at the t-1 moment;

(22) weight w to K-class normal distribution model_k,tUpdating is carried out, and the expression is as follows:

w_k,t＝(1-α)*w_k,t-1+α*M_k,t

wherein α represents a learning rate, w_k,t-1Representing the weight of the kth normal distribution model at the t-1 moment; m_k,tRepresenting the matching judgment of the kth model at the time t, if the formula in the step (21) is established, matching the pixel points with a Gaussian model M _k,t1, otherwise M_k,t＝0；

(23) If the current pixel point does not meet the formula in the step (21), the current pixel point does not belong to the background image, and the pixel mean value mu is obtained_i,t-1And standard deviation σ_i,t-1Keeping the same;

if the current pixel point meets the formula in the step (21), which indicates that the current pixel point belongs to the background image, updating the parameters of the current distribution model, wherein the expression is as follows:

ρ＝α*η(X_t∣μ_k,σ_k)

μ_t＝(1-ρ)*μ_t-1+ρ*X_t

where ρ represents the intermediate parameter, η (X)_t∣μ_k,σ_k) A learning rate change function representing the kth model at the time t;

(24) if all the pixel points do not meet the formula in the step (21), modifying the distribution model parameter with the minimum weight in the Gaussian mixture model, and modifying the average value into the current pixel value;

(25) sequencing the K distribution models, sequentially arranging the K distribution models according to the sequence of the weights from large to small, selecting the first B distribution models as background pixels, selecting the other models as foreground pixels, and obtaining the expression of B:

wherein T is the proportion of the background, and b is the number of the selected models.

Further, after the key frame image is sent to the YOLO detector in step (3), the key frame establishes dynamic candidate regions, receives each candidate region, and discards candidate regions that cannot be identified.

Further, after the non-key frame image trained in the step (2) is input into a YOLO detector, a foreground dynamic region provides a priori for the detector, and the YOLO detector estimates the foreground dynamic region of the key frame according to the foreground dynamic region of the non-key frame; when a foreground dynamic area provided by a non-key part is overlapped with a dynamic target detected by a YOLO detector, accepting the current dynamic target candidate area; if the foreground dynamic area provided by the non-key is not overlapped with the dynamic target detected by the YOLO detector, discarding the current dynamic target candidate area; and taking the received dynamic target candidate area as a key frame foreground dynamic target estimated by the YOLO detector.

Further, step (3): and tracking the foreground dynamic target of the key frame estimated by the YOLO by using a particle filter algorithm, and updating the position and length information of the foreground dynamic target of the next frame of the current frame.

Further, the detection network used by the YOLO detector is YOLOv 3.

Further, before the correction in the step (1), equalization processing is performed on each pixel point of the image in the non-key frame.

Further, the step (1) of extracting the feature points is realized by ORB-SLAM 2.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:

1. the invention utilizes the global discontinuous characteristic of the key frame, trains the background image through GMM, segments the foreground dynamic area and provides prior for YOLO;

2. by utilizing the advantages of high speed and good robustness of YOLOv3, the dynamic target detection between continuous frames is realized, and the detection precision of a dynamic area is improved;

3. and the dynamic target is tracked for a long time by using YOLOv3 and combining a particle filter algorithm, so that the stable operation of the tracked target is effectively ensured.

Drawings

FIG. 1 is a schematic representation of an affine transformation matrix solution;

FIG. 2 is a diagram of GMM dynamic solution;

FIG. 3 is a GMM dynamic target detection flow diagram;

FIG. 4 is a schematic diagram of a bounding box regression;

FIG. 5 is a schematic diagram of a dynamic target detection of YOLO;

FIG. 6 is a schematic diagram of a dynamic target detection by YOLO;

fig. 7 is a flow chart of dynamic target tracking based on particle filtering.

Detailed Description

The GMM-based and YOLO-based real-time tracking and mapping method in the embodiment includes the following steps:

(1) extracting feature points from each frame of image in the dynamic image by ORB-SLAM2, dividing into key frames and non-key frames, and calculating affine transformation matrix between two adjacent frames in the non-key frames as shown in FIG. 1, wherein (x)₁,y₁)、(x₂,y₂)、(x₃,y₃) For the current frame corresponding mapping point coordinates, (x)₁’,y₁’)、(x₂’,y₂’)、(x₃’,y₃') are the mapping point coordinates corresponding to the previous frame, and P1, P2 and P3 are three-dimensional points.

And (4) carrying out equalization processing on each pixel point of the image in the non-key frame, and correcting the non-key frame by using an affine transformation matrix.

The affine transformation matrix has the relation:

where the left matrix of the equation

Representing the current frame coordinates, right of equation

A matrix of the simulated transformation is represented,

representing the coordinates of the previous frame of the current frame.

(2) Images in the non-key frame phase are trained using GMM, as shown in FIG. 2. The background image is modeled by the GMM to further segment the foreground dynamic region, the specific process is as follows, and the flow chart is as shown in fig. 3.

(21) And (3) matching each pixel point of the non-key frame stage image by using the GMM, finding a model matched with each pixel point in the K-type normal distribution model, and if the pixel point belongs to the current normal distribution model, satisfying the following formula:

|X_t-μ_i，t-1|≤2.5σ_i，t-1

in the formula X_tRepresenting pixel points to be matched, i representing the category corresponding to the normal distribution model, mu_i，t-1Represents the pixel mean value, sigma corresponding to the ith normal distribution model at the t-1 moment_i，t-1Representing the standard deviation of all pixels corresponding to the ith normal distribution model at the t-1 moment;

(22) weight w to K-class normal distribution model_k，tUpdating is carried out, and the expression is as follows:

w_k，t＝(1-α)*w_k，t-1+α*M_k，t

wherein α represents a learning rate, w_k，t-1Representing the weight of the kth normal distribution model at the t-1 moment; m_k，tRepresents the k model matching judgment at the time tOtherwise, if the formula in step (21) is established, the pixel point belongs to a certain model in the Gaussian mixture model, M _k，t1 is ═ 1; if the formula in the step (21) is not satisfied, the pixel point does not belong to a Gaussian mixture model, M_k，t＝0；

(23) If the current pixel point does not meet the formula in the step (21), the current pixel point does not belong to the background image, and the pixel mean value mu is obtained_i，t-1And standard deviation σ_i，t-1Keeping the same;

ρ＝α*η(X_t|μ_k，σ_k)

μ_t＝(1-ρ)*μ_t-1+ρ*X_t

where ρ represents the intermediate parameter, η (X)_t|μ_k，σ_k) A learning rate change function representing the kth model at the time t;

(24) if all the pixel points do not meet the formula in the step (21), modifying the distribution model parameter with the minimum weight in the Gaussian mixture model, modifying the mean value into the current pixel value, modifying the standard deviation to be larger than the previous standard deviation, and modifying the weight to be smaller than the previous weight;

And (3) after the key frame image is sent to a YOLO detector, the key frame establishes dynamic candidate areas, each candidate area is received, and candidate areas which cannot be identified are discarded.

Step (3) inputting the non-key frame image trained in the step (2) into a YOLO detector, wherein a foreground dynamic region provides a priori for the detector, and the YOLO detector estimates the foreground dynamic region of the key frame according to the foreground dynamic region of the non-key frame; when a foreground dynamic area provided by a non-key part is overlapped with a dynamic target detected by a YOLO detector, accepting the current dynamic target candidate area; if the foreground dynamic area provided by the non-key is not overlapped with the dynamic target detected by the YOLO detector, discarding the current dynamic target candidate area; and taking the received dynamic target candidate area as a key frame foreground dynamic target estimated by the YOLO detector.

As shown in fig. 5, in the present embodiment, a YOLOv3 target detection algorithm is adopted, a Darknet-53 is adopted as a network main body frame, an input picture is divided into 13 × 13 tables, a dynamic target is detected by using each cell, each cell includes a bounding box and a recognition probability value, and thus, whether the cell includes a dynamic target object, and position information and probability information of the target object is determined. Selecting a priori frames of 3 scales and 9 types in a mode of selecting dimension clustering on the boundary frames, converting the detection problem of the boundary frames into a regression problem, and predicting 4 coordinate offsets t of each boundary frame as shown in figure 4_x,t_y,t_w,t_hAnd calculating the result of the target frame through the offset, wherein the calculation formula is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein t is_x,t_y,t_w,t_hRespectively representing the offset of x-coordinate, y-coordinate, width, height, b_x、b_y、b_w、b_hRepresents the result of the final target box, σ () represents the Sigmoid function, c_x、c_yThe grid number coordinate representing the current position in the characteristic diagram offset relative to the grid at the upper left corner, and the result of x is normalized to accelerate the network convergence speed, p_wAnd p_hIs the width and height of the prior box.

The prior box includes: for the characteristic graph with the scale of 13 × 13, three prior frames with the pixel sizes of 10 × 13, 16 × 30 and 33 × 23 are adopted; for a characteristic scale map of 26 × 26, three prior boxes with the sizes of 30 × 61, 62 × 45 and 59 × 119 pixels are adopted; for a feature scale map size of 52 × 52, three prior boxes of 116 × 90, 156 × 198, 373 × 326 pixel sizes are used.

The loss function of YOLOv3 mainly includes the following three parts:

objective confidence loss function

Objective classification penalty function

Target location offset loss function

An overall Loss function Loss is established through three Loss function models:

wherein o is_ijWhether the rectangular box is responsible for predicting a target object or not is shown, if the rectangular box is responsible for predicting a target, the size of the rectangular box is 1, and if not, the rectangular box is equal to 0; c is the probability score of the target object contained in the prediction frame,

the mark box contains the actual probability score of the target object, p represents the probability of a certain category,

representing the true value of the category to which the mark box belongs; (x)_ij,y_ij) Is the coordinates of the center of the rectangular box of the network prediction,

is the center coordinate of the marked rectangular frame, (w)_ij,h_ij) The size of the width and height of the rectangle predicted for the network,

is the size of the width and height of the marked rectangular frame,

the true values are c, p, g are fitting values.

A schematic diagram of dynamic regional target detection using YOLO is shown in fig. 6, where a dotted frame represents a foreground dynamic target candidate region provided by GMM, a solid frame represents that YOLO 3 detects all dynamic targets, the overlap degree IOU result is used as probability information to obtain the most possible dynamic target, for example, position(s), while the region(s) in which GMM dynamic detection fails, for example, position(s), is discarded, and other static targets obtained by YOLO 3, for example, position(s), are tracked with respect to the solid frame(s), for example, position(s).

Tracking the foreground dynamic target of the key frame estimated by the YOLO by using a particle filter algorithm, and updating the position and length information of the foreground dynamic target of the next frame of the current frame, wherein the specific process comprises the following steps:

firstly, a state equation and an observation equation are set up as follows:

x_r＝f_r(x_r-1,v_r-1)

y_r＝h_r(x_r,n_r)

where r represents the current time, x represents the state quantity, v and n represent the noise quantity, f is the state transfer function, and h is the measurement function. Particle filtering exists in order to calculate the x of maximum confidence_rI.e. maximum a posteriori p (x) obeying Bayesian probability_r|y_1:r). The probability distribution function at the last moment is known as p (x)_r-1|y_1:r-1) And the state transition obeys the requirement of the first-order markov model, i.e. linear time sequence relation, and the measured data is only related to the state value, the particle filter algorithm based on target identification follows the following steps, and the flow chart is shown in fig. 7.

(301) Initialization: and constructing N particles, giving the same weight to each particle, and uniformly distributing the particles on the image, wherein each particle meets a state equation and an observation equation. Meanwhile, in order to apply the particle filter equation to the field of target tracking, attribute information is given to each particle, wherein the attribute information comprises target position information, target speed information, target area frame length information, target area frame width information and target weight information. Setting a state transition matrix A and an observation matrix C as follows:

C＝[I 0]

wherein the matrix I is represented as an identity matrix;

(302) predicting the state of the current particle according to the state results of the N particles in the previous frame by using a state equation:

p(x_r∣y_1:r-1)＝∫p(x_r,x_r-1∣y_1:r-1)dx_r-1

＝∫p(x_r∣x_r-1,y_1:r-1)p(x_r-1∣y_1:r-1)dx_r-1

＝∫p(x_r∣x_r-1)p(x_r-1∣y_1:r-1)dx_r-1

(303) a correction stage: and calculating the weight of each particle through an observation equation, calculating weight information according to the descriptor matching result, the IOU value and the pixel consistency result, and finally performing normalization calculation on the weight results of all candidate particles. In order to avoid the situation that VSLAM local map and pose tracking fails due to the fact that a large number of VSLAM video sequences are divided into dynamic regions, the observation results are limited, the pixels of the observation regions are larger than half of the video frames, and dynamic result collection between two key frames is omitted. Also, if a large number of small dynamic objects appear, the matching time of the descriptors is sharply increased, an upper limit 10 for the number of observations is set, the upper limit is exceeded, and the IOU and pixel consistency results are used as weight information. The expected outcome of the particle states is calculated using monte carlo sampling estimates for the calculation of the integration results.

Wherein q (x)_r∣y_1:r) Is a simple probability distribution function introduced.

(304) And (3) a resampling stage: and screening the predicted particles through the particle weight, and reserving a large number of particles with large weights. All in oneIn order to avoid the problem of particle weight degradation, the particles with low weight are discarded, and the particles with large weight are copied to complement the discarded number of particles according to the weight proportion. The resampled particles represent the probability distribution of the true state. Weight w_iThe calculation formula is as follows:

wherein w_iouFor the result of the cross-over ratio (normalization) of the prediction box and the detection box, w_fThe number ratio and normalization result obtained by matching the target area image intercepted from the key frame with the current frame image, w_appAs a result of appearance similarity (normalization). The process of prediction updating is completed sequentially through steps (302), (303), (304) and (302), and the above steps (301) - (304) only exist in the image sequence between two key frames. When the VSLAM constructs the key frame, whether a new target area needs to be constructed is judged again. Therefore, an incremental model is added during the construction of the key frame to ensure that dynamic increments can be stably tracked in the tracking process, or whether to cancel tracking of lost target information is judged through the incremental model.

Claims

1. The GMM-based and YOLO-based real-time tracking and mapping method under the dynamic environment is characterized by comprising the following steps of:

(2) training images at a non-key frame stage by using a Gaussian mixture model, modeling background images and segmenting foreground dynamic regions by using GMM (Gaussian mixture model);

2. The real-time tracking and mapping method according to claim 1, wherein the modeling process of step (2) comprises:

|X_t-μ_i,t-1|≤2.5σ_i,t-1

w_k,t＝(1-α)*w_k,t-1+α*M_k,t

wherein α represents a learning rate, w_k,t-1Representing the weight of the kth normal distribution model at the t-1 moment; m_k,tRepresenting the matching judgment of the kth model at the time t, if the formula in the step (21) is established, matching the pixel points with a Gaussian model M_k,t1, otherwise M_k,t＝0；

y＝α*η(X_t|μ_k，σ_k)

μ_t＝(1-P)*μ_t-1+ρ*X_t

3. The real-time tracking and mapping method of claim 1, wherein after the step (3) of sending the keyframe image to a YOLO detector, the keyframe establishes dynamic target candidate regions, accepts the respective candidate regions, and discards candidate regions that cannot be identified.

4. The real-time tracking and mapping method according to claim 3, wherein step (3) inputs the non-key frame image trained in step (2) to a YOLO detector, the foreground dynamic region provides a priori for the detector, and the YOLO detector estimates the foreground dynamic region of the key frame according to the foreground dynamic region of the non-key frame; when a foreground dynamic area provided by a non-key part is overlapped with a dynamic target detected by a YOLO detector, accepting the current dynamic target candidate area; if the foreground dynamic area provided by the non-key is not overlapped with the dynamic target detected by the YOLO detector, discarding the current dynamic target candidate area; and taking the received dynamic target candidate area as a key frame foreground dynamic target estimated by the YOLO detector.

5. The real-time tracking and mapping method according to claim 4, wherein the step (3): and tracking the foreground dynamic target of the key frame estimated by the YOLO by using a particle filter algorithm, and updating the position and length information of the foreground dynamic target of the next frame of the current frame.

6. The real-time tracking and mapping method according to claim 5, wherein the detection network used by the YOLO detector is YOLOv 3.

7. The real-time tracking and mapping method according to claim 1, wherein before the correction in step (1), each pixel point of the image in the non-key frame is equalized.

8. The real-time tracking and mapping method according to claim 1, wherein the step (1) of extracting feature points is implemented by ORB-SLAM 2.