CN109345559B

CN109345559B - Moving target tracking method based on sample expansion and depth classification network

Info

Publication number: CN109345559B
Application number: CN201811005680.0A
Authority: CN
Inventors: 田小林; 荀亮; 李芳�; 李帅; 焦李成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2021-08-06
Anticipated expiration: 2038-08-30
Also published as: CN109345559A

Abstract

The invention discloses a moving target tracking method based on sample expansion and a depth classification network. The invention has the following steps: (1) building a depth classification network model; (2) generating a positive sample set; (3) generating a negative sample set; (4) training a deep classification network model; (5) extracting a target characteristic network model; (6) predicting the target position of the next frame image; (7) judging whether the current frame image is the last frame video image of the video sequence to be tracked, if so, executing the step (8), otherwise, executing the step (6); (8) and finishing the tracking of the moving target to be tracked. The invention trains the depth classification network by using the expanded sample set and determines the position of the target to be tracked by using the characteristic response value, thereby solving the problem of inaccurate tracking caused by appearance deformation and shielding of the target.

Description

Moving target tracking method based on sample expansion and depth classification network

Technical Field

The invention belongs to the technical field of image processing, and further relates to a moving target tracking method based on sample expansion and a depth classification network in the field of computer vision. The method can be used for tracking the moving target in complex scenes such as video monitoring, robot navigation, video sequences acquired by an unmanned aerial vehicle and the like.

Background

The main task of the moving object tracking is to detect a moving object from a continuous video image sequence, and then determine the position of the moving object in each frame of image. With the continuous and deep understanding of people on the field of computer vision, the moving target tracking is widely applied and developed in the field, and the current deep learning method is gradually applied to the field of target tracking. Compared with a manual feature extraction method which excessively depends on the priori knowledge of a designer in the traditional tracking method, the deep learning method can utilize the advantages of big data, and the neural network can automatically learn features through the training of a large amount of data. Under the condition that the training data is large enough, the feature extraction obtained by deep learning is far better than the feature extracted by a manual method. However, when the deep learning method is applied to the field of target tracking, the main problem lies in the loss of training data: one of the advantages of the depth model comes from the efficient learning of a large amount of annotated training data, while target tracking only provides the bounding-box of the first frame as training data.

The patent document of the university in zhongshan "a feature extraction and target tracking method based on a convolutional neural network" (patent application No. 201711262806.8, publication No. 105678338a) discloses a tracking method for a moving target using a deep convolutional network. The method comprises the specific steps of (1) constructing and pre-training a network model; (2) inputting a first frame of video image into the reconstructed network for repeated iteration according to the video sequence, and training a network model on line; (3) inputting a video sequence, and calculating a tracking result; (4) and evaluating the tracking result of the last frame in the video sequence, and selecting a positive sample result to be put into the network for iteration so as to update the network parameters. The method has the disadvantages that when the pre-trained network model is used for on-line training of the first frame image of the video, the image is input into the reconstructed network for repeated iteration, so that overfitting is easy to occur, when a target in a subsequent video frame is deformed to a large extent, drift is easy to occur, and long-term accurate tracking cannot be realized.

A method for tracking a moving target by using local feature learning is disclosed in a patent document "target tracking method based on local feature learning" (patent application No. 201610024953.0, publication No. 108038435a) applied by the university of south china agriculture. The method is realized by the specific steps of (1) decomposing a target area and a background area into a large number of local areasThe unit is used for training and constructing an appearance model in a deep learning mode; (2) calculating the confidence coefficient that each local area of the next frame of image belongs to the target to obtain a confidence coefficient map for positioning the target; (3) setting a threshold value T_posAnd T_negThe threshold value is larger than T_posAdding the local area into the target sample set, and reducing the threshold value to be less than T_negAdding a background sample set into the local area, and updating the appearance model. The method has the disadvantages that because the method needs to judge the sample type of each local area of the image by setting a threshold value, when the target to be tracked is shielded to a large extent, the target sample or the background sample can be wrongly divided, so that the updated model cannot continuously and accurately track the target.

Disclosure of Invention

The invention aims to provide a moving target tracking method based on sample expansion and a depth classification network aiming at the defects of the prior art so as to accurately and effectively track a target when the target is deformed, changed in scale or shielded.

The idea of achieving the purpose of the invention is that firstly, aiming at the problem of insufficient training samples, a sample expansion method is utilized to generate a positive and negative sample set containing a large number of images. And secondly, improving the depth residual error network ResNet50 to obtain a depth classification network model, and further extracting a target feature network model. And finally, inputting the image sequence intercepted in the candidate region into a target characteristic network model, and obtaining the specific position of the target to be tracked according to the characteristic response value.

The method comprises the following specific steps:

(1) constructing a deep classification network model:

(1a) building a 3-layer fully-connected network, wherein the first layer of the network is an input layer, the second layer is a hidden layer, and the third layer is an output layer;

(1b) the parameters of each layer in the fully connected network are set as follows: setting the number of the neurons of the first layer as 1024, and setting an activation function as a modified linear unit ReLU function; setting the number of the neurons of the second layer to be 2; setting the number of the neurons in the third layer as 2, and setting the activation function as a Sigmoid function;

(1c) taking the output of the depth residual error network ResNet50 as the input of a full-connection network to obtain a depth classification network model;

(2) generating a positive sample set:

(2a) inputting a first frame image in a video image sequence containing a target to be tracked, and determining a rectangular frame by taking the center of the initial position of the target to be tracked as the center and the length and the width of the target to be tracked as the length and the width;

(2b) 3000 rectangular target images with the same size are intercepted from the rectangular frame to form a positive sample set;

(2c) randomly selecting one rectangular target image from the positive sample set, uniformly cutting the rectangular target image into 3 multiplied by 3 small rectangular images by 3 equal parts, discarding the part which is less than 3 equal parts, randomly combining and splicing the 9 divided small rectangular images into 4000 recombined images with the same size as the rectangular target image to form a recombined image set;

(2d) in a first frame of video image, acquiring a scale change image set in a single-pixel stepping mode;

(2e) adding the recombined image set and the scale change image set into a positive sample set to form an expanded positive sample set;

(3) generating a negative sample set:

(3a) in a first frame image of a video, determining 5 large rectangular frames by taking an initial position of a target to be tracked as a center, wherein the length and the width of each large rectangular frame are respectively 1.5, 1.6, 1.7, 1.8 and 1.9 times of the length and the width of the target to be tracked;

(3b) selecting a rectangular sliding frame with the same size as the target to be tracked;

(3c) sliding the rectangular sliding frame in each large rectangular frame, intercepting images in the rectangular sliding frame after each translation, and forming an expanded negative sample set by all the intercepted images;

(4) training a deep classification network model:

(4a) inputting the extended positive sample set and the extended negative sample set into a deep classification network model;

(4b) updating the weight of each node in the deep classification network model by using a random gradient descent method to obtain a trained deep classification network model;

(5) extracting a target characteristic network model:

deleting an output layer of a full-connection network in the trained deep classification network model, taking a hidden layer of the full-connection network as an output layer of a target characteristic network, and extracting the target characteristic network model;

(6) predicting the target position of the current frame image:

(6a) loading a next frame image in a video sequence to be tracked as a current frame image, establishing a rectangular area by using the size which is 1.5 times of the length and width of a target to be tracked in the current frame image by taking the position of the target to be tracked of the loaded previous frame image as the center, and taking the rectangular area as a candidate area of the target to be tracked;

(6b) intercepting rectangular images in a candidate region of a target to be tracked in a sliding mode with step change, and forming a candidate image sequence by all the intercepted rectangular images;

(6c) inputting the candidate image sequence into a target characteristic network model, outputting a characteristic response value sequence corresponding to each candidate image, and selecting a maximum characteristic response value from the characteristic response value sequence;

(6d) in the current frame image, taking the position of the candidate image corresponding to the maximum characteristic response value as the position of the tracking target;

(7) judging whether the current frame video image is the last frame video image of the video image sequence to be tracked, if so, executing the step (8), otherwise, executing the step (6);

(8) and finishing the tracking of the moving target to be tracked.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention generates the extended positive and negative sample sets, the invention overcomes the problems that in the prior art, when the network is trained, the first frame image is used for repeated iteration, so that overfitting is easy, and when the target to be tracked is deformed to a large extent, the tracking is inaccurate, so that the invention can more accurately track the target when the target to be tracked is deformed to a large extent.

Secondly, the invention constructs a characteristic network model of the target, and judges the position of the target to be tracked by using the target characteristic response value, thereby overcoming the problem that the updated model can not continuously and accurately track the target due to the fact that positive and negative samples are easily mistakenly divided when the target to be tracked generates a large degree of shielding in the prior art, and enabling the invention to more accurately track the target when the target to be tracked generates a large degree of shielding.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a simulation of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The specific steps of the present invention are further described with reference to fig. 1.

Step 1, constructing a deep classification network model.

And (3) constructing a 3-layer full-connection network, wherein the first layer of the network is an input layer, the second layer is a hidden layer, and the third layer is an output layer.

The parameters of each layer in the fully connected network are set as follows: setting the number of the neurons of the first layer as 1024, and setting an activation function as a modified linear unit ReLU function; setting the number of the neurons of the second layer to be 2; the number of neurons in the third layer is set to 2, and the activation function is set to a Sigmoid function.

And taking the output of the depth residual error network ResNet50 as the input of the full-connection network to obtain a depth classification network model.

And 2, generating a positive sample set.

Inputting a first frame image in a video image sequence containing a target to be tracked, and determining a rectangular frame by taking the center of the initial position of the target to be tracked as the center and the length and the width of the target to be tracked as the length and the width.

3000 rectangular target images with the same size are intercepted from the rectangular frame to form a positive sample set.

Randomly selecting one rectangular target image from the positive sample set, uniformly cutting the rectangular target image into 3 multiplied by 3 small rectangular images by 3 equal parts, discarding the part which is less than 3 equal parts, randomly combining and splicing the 9 divided small rectangular images into 4000 recombined images with the same size as the rectangular target image, and forming a recombined image set.

In a first frame image of the video, a scale change image set is obtained in a single-pixel stepping mode.

The specific steps of the single-pixel stepping mode are as follows:

step 1, forming a small rectangular frame by the center of the initial position of the target to be tracked and the length and the width which are 0.5 times of the initial position.

And step 2, keeping the center of the small rectangular frame unchanged, increasing the width of the small rectangular frame by 1 pixel, and taking the product of the aspect ratio of the small rectangular frame and the increased width as the length to form a temporary rectangular frame.

And 3, judging whether the width of the temporary rectangular frame is more than or equal to 3 times of the width of the small rectangular frame, if so, executing the fourth step, otherwise, intercepting and storing the rectangular image in the temporary rectangular frame and then executing the second step.

And 4, forming all the intercepted images into a scale change image set.

And adding the recombined image set and the scale change image set into the positive sample set to form an expanded positive sample set.

And 3, generating a negative sample set.

In a first frame image of a video, 5 large rectangular frames are determined by taking an initial position of a target to be tracked as a center, and the length and the width of each large rectangular frame are respectively 1.5, 1.6, 1.7, 1.8 and 1.9 times of the length and the width of the target to be tracked.

And selecting a rectangular sliding frame with the same size as the target to be tracked.

And sliding the rectangular sliding frame in each large rectangular frame, intercepting the images in the rectangular sliding frame after each translation, and forming an expanded negative sample set by all the intercepted images.

The sliding mode of the rectangular sliding frame in each large rectangular frame is as follows: and (3) sequentially translating the rectangular sliding frame to the upper right corner, the lower right corner and the lower left corner of each large rectangular frame by taking the upper left corner of each large rectangular frame as a starting point and 1 pixel as a step length, and finally translating the rectangular sliding frame back to the upper left corner.

And 4, training a deep classification network model.

The augmented positive sample set and the augmented negative sample set are input to a deep classification network model.

And updating the weight of each node in the deep classification network model by using a random gradient descent method to obtain the trained deep classification network model.

The random gradient descent method comprises the following specific steps:

and step 1, randomly selecting a number in the range of (0,0.1), and using the number as an initial weight of each node in the deep classification network model.

And 2, taking the initial weight of each node as the current weight of each node in the deep classification network model in the first iteration process.

Step 3, randomly selecting 2 from the positive and negative sample setsⁿThe sample images are propagated in the depth classification network model in the forward direction, wherein n is more than or equal to 3 and less than or equal to 7, and the output layer of the depth classification network model outputs 2ⁿAnd (5) classifying the sample images.

And 4, calculating the average logarithmic loss value of the classification result according to the following formula:

wherein L represents the average log loss value of the classification result, N represents the total number of randomly selected sample images, Σ represents the summation operation, i represents the serial number of the input sample image, y represents the average log loss value of the classification result, y represents the total number of the randomly selected sample images, y represents the sum of the sum_iY representing the class of the ith input sample image, positive class sample_iValue 1, y of negative class sample_iThe value takes 0, log denotes base 10 logarithm operation, p_iAnd representing the output value of the depth classification network model of the ith sample image in the classification result.

And 5, solving the partial derivative of the current weight of each node in the depth classification network by using the average logarithmic loss value to obtain the gradient value of the current weight of each node in the depth classification network model.

And 6, calculating the updated weight of each node in the deep classification network model according to the following formula:

wherein the content of the first and second substances,

represents the updated weight value w of the kth node of the depth classification network model_kRepresenting the current weight of the kth node of the depth classification network model, alpha representing the learning rate, and the value range of the alpha representing the learning rate is (0,1), and delta w_kAnd representing the gradient value of the current weight of the kth node in the depth classification network model.

And 7, judging whether all sample images in the training sample set are selected, if so, obtaining a trained deep classification network model, and otherwise, executing the 3 rd step after the updated weight of each node is taken as the current weight.

And 5, extracting a target characteristic network model.

And deleting the output layer of the fully-connected network in the trained deep classification network model, taking the hidden layer of the fully-connected network as the output layer of the target characteristic network, and extracting the target characteristic network model.

And 6, predicting the target position of the current frame image.

And loading the next frame image in the video sequence to be tracked as a current frame image, establishing a rectangular area by using the size which is 1.5 times of the length and width of the target to be tracked in the current frame image by taking the position of the target to be tracked of the loaded previous frame image as the center, and taking the rectangular area as a candidate area of the target to be tracked.

And intercepting the rectangular images in the candidate region of the target to be tracked in a sliding mode with step change, and forming a candidate image sequence by all the intercepted rectangular images.

The step change sliding mode comprises the following specific steps:

and step 1, selecting a rectangular sliding frame with the same size as the target to be tracked, and respectively setting the maximum sliding step length and the minimum sliding step length in the directions of the x axis and the y axis.

And 2, placing a rectangular sliding frame at the upper left corner of the target candidate area to be tracked.

And 3, calculating the sliding step length in the positive direction of the x axis according to the following formula.

Wherein S is_xRepresenting the step of sliding in the positive x-direction, S_x1Denotes the maximum sliding step in the x-axis direction, S_x2And the minimum sliding step length in the x-axis direction is represented, w represents the width of the target to be tracked, u' represents the abscissa of the central point of the rectangular sliding frame, and u represents the abscissa of the central point of the candidate area of the target to be tracked.

And 4, sliding the rectangular sliding frame in the sliding step length of the positive direction of the x axis, and capturing the framed image.

And 5, judging whether the rectangular sliding frame exceeds the candidate area of the target to be tracked, if so, translating the rectangular sliding frame to the leftmost side of the candidate area of the target to be tracked along the negative direction of the x axis and then executing the step 6, and if not, executing the step 3.

And 6, calculating the sliding step length in the positive direction of the y axis according to the following formula:

wherein S is_yDenotes the step of sliding in the positive y-axis direction, S_y1Denotes the maximum sliding step in the y-axis direction, S_y2The minimum sliding step length in the y-axis direction is represented, h represents the length of the target to be tracked, v' represents the ordinate of the center point of the current position of the rectangular frame, and v represents the ordinate of the center point of the candidate area of the target to be tracked.

And 7, sliding the rectangular sliding frame in the positive direction of the y axis by the sliding step length, and capturing the framed image.

And 8, judging whether the rectangular sliding frame exceeds the candidate area of the target to be tracked, if so, executing the ninth step, otherwise, executing the third step.

And 9, forming a candidate image sequence by all the intercepted images.

And inputting the candidate image sequence into a target characteristic network model, outputting a characteristic response value sequence corresponding to each candidate image, and selecting a maximum characteristic response value from the characteristic response value sequence.

And in the current frame image, taking the position of the candidate image corresponding to the maximum characteristic response value as the position of the tracking target.

And 7, judging whether the current frame video image is the last frame video image of the video image sequence to be tracked, if so, executing the step 8, otherwise, executing the step 6.

And 8, finishing tracking the moving target to be tracked.

The effect of the present invention will be further explained with the simulation experiment.

1. Simulation experiment conditions are as follows:

the hardware test platform of the simulation experiment of the invention is as follows: the CPU is intel Core i5-6500, the main frequency is 3.2GHz, the memory is 8GB, and the GPU is NVIDIA TITAN Xp; the software platform is as follows: ubuntu 16.04 LTS, 64-bit operating system, python 3.6.5.

2. Simulation content simulation result analysis:

the simulation experiment of the present invention is a simulation experiment performed on a video image sequence of a man walking on a road from an Object tracking benchmark 2015 database by using the method of the present invention, wherein the video image sequence has a total of 252 frames of video images, and the result of the simulation experiment of the present invention is shown in fig. 2.

Fig. 2(a) is a 1 st frame image of a video image sequence input by a simulation experiment of the present invention, and the position of a solid-line rectangular frame in fig. 2(a) represents an initial position of a target to be tracked.

Fig. 2(b) is a schematic diagram of a tracking result of a frame of video image when an input target to be tracked is occluded in a simulation experiment of the present invention. The man in the video image is the target to be tracked, and the target to be tracked is shielded. Determining a candidate region of a target to be tracked from the video image, inputting a candidate image sequence intercepted in the candidate region into a target characteristic network to obtain a characteristic response value sequence corresponding to each candidate image, and taking the position of the candidate image corresponding to the maximum characteristic response value in the current frame image as the position of the target to be tracked. The solid rectangular box in fig. 2(b) indicates the position of the target to be tracked.

Fig. 2(c) is a schematic diagram of a tracking result of a frame of video image when a target to be tracked is deformed, which is input in a simulation experiment of the present invention. The man in the video image is the target to be tracked, and the target to be tracked generates deformation. Determining a candidate region of a target to be tracked from the video image, inputting a candidate image sequence intercepted in the candidate region into a target characteristic network to obtain a characteristic response value sequence corresponding to each candidate image, and taking the position of the candidate image corresponding to the maximum characteristic response value in the current frame image as the position of the target to be tracked. The solid rectangular box in fig. 2(c) indicates the position of the target to be tracked.

As can be seen from fig. 2(b) and 2(c), the target framed by the solid-line rectangular frame in the figure is consistent with the target framed by the solid-line rectangular frame in fig. 2(a), which shows that the target tracking method can accurately and effectively track the target when the target to be tracked in the video image is deformed and shielded.

Claims

1. A moving target tracking method based on sample expansion and a deep classification network is characterized in that a deep classification network model is constructed, positive and negative sample sets are generated, the deep classification network model is trained, and a target characteristic network model is extracted; the method comprises the following specific steps:

(1) constructing a deep classification network model:

(2) generating a positive sample set:

(3) generating a negative sample set:

(4) training a deep classification network model:

(5) extracting a target characteristic network model:

(6) predicting the target position of the current frame image:

(8) and finishing the tracking of the moving target to be tracked.

2. The method for tracking the moving object based on the sample expansion and depth classification network of claim 1, wherein the specific steps of the single-pixel stepping mode in the step (2d) are as follows:

step one, forming a small rectangular frame by using the center of the initial position of the target to be tracked and the length and width which are 0.5 times of the initial position;

keeping the center of the small rectangular frame unchanged, increasing the width of the small rectangular frame by 1 pixel, and taking the product of the length-width ratio of the small rectangular frame and the increased width as the length to form a temporary rectangular frame;

thirdly, judging whether the width of the temporary rectangular frame is more than or equal to 3 times of the width of the small rectangular frame, if so, executing the fourth step, otherwise, intercepting and storing the rectangular image in the temporary rectangular frame and then executing the second step;

and fourthly, forming all the intercepted images into a scale change image set.

3. The method for tracking the moving object based on the sample expansion and depth classification network of claim 1, wherein the rectangular sliding frame slides in each large rectangular frame in the step (3c) in a manner that: and (3) sequentially translating the rectangular sliding frame to the upper right corner, the lower right corner and the lower left corner of each large rectangular frame by taking the upper left corner of each large rectangular frame as a starting point and 1 pixel as a step length, and finally translating the rectangular sliding frame back to the upper left corner.

4. The method for tracking the moving object based on the sample expansion and depth classification network of claim 1, wherein the random gradient descent method in the step (4b) comprises the following specific steps:

step one, randomly selecting a number in a range of (0,0.1), and using the number as an initial weight of each node in a deep classification network model;

secondly, taking the initial weight of each node as the current weight of each node in the deep classification network model in the first iteration process;

thirdly, randomly selecting 2 from the positive and negative sample setsⁿThe sample images are propagated in the depth classification network model in the forward direction, wherein n is more than or equal to 3 and less than or equal to 7, and the output layer of the depth classification network model outputs 2ⁿClassifying results of the sample images;

fourthly, calculating the average logarithmic loss value of the classification result according to the following formula:

wherein L represents the average log loss value of the classification result, N represents the total number of randomly selected sample images, Σ represents the summation operation, i represents the serial number of the input sample image, y represents the average log loss value of the classification result, y represents the total number of the randomly selected sample images, y represents the sum of the sum_iY representing the class of the ith input sample image, positive class sample_iValue 1, y of negative class sample_iThe value takes 0, log denotes base 10 logarithm operation, p_iRepresenting the output value of the depth classification network model of the ith sample image in the classification result;

fifthly, solving a partial derivative of the current weight of each node in the depth classification network by using the average logarithmic loss value to obtain a gradient value of the current weight of each node in the depth classification network model;

sixthly, calculating the updated weight of each node in the deep classification network model according to the following formula:

wherein the content of the first and second substances,

represents the updated weight value w of the kth node of the depth classification network model_kRepresenting the current weight of the kth node of the depth classification network model, alpha representing the learning rate, and the value range of the alpha representing the learning rate is (0,1), and delta w_kRepresenting the gradient value of the current weight of the kth node in the depth classification network model;

and seventhly, judging whether all the sample images in the training sample set are selected, if so, obtaining a trained deep classification network model, and otherwise, executing the third step after taking the updated weight of each node as the current weight.

5. The method for tracking the moving object based on the sample expansion and depth classification network of claim 1, wherein the step change sliding manner in step (6b) comprises the following steps:

the method comprises the steps that firstly, a rectangular sliding frame with the same size as a target to be tracked is selected, and the maximum sliding step length and the minimum sliding step length in the x-axis direction and the y-axis direction are respectively set;

secondly, placing a rectangular sliding frame at the upper left corner of the candidate area of the target to be tracked;

thirdly, calculating the sliding step length in the positive direction of the x axis according to the following formula:

wherein S is_xRepresenting the step of sliding in the positive x-direction, S_x1Denotes the maximum sliding step in the x-axis direction, S_x2The minimum sliding step length in the x-axis direction is represented, w represents the width of a target to be tracked, u' represents the abscissa of the center point of a rectangular sliding frame, and u represents the abscissa of the center point of a candidate area of the target to be tracked;

fourthly, sliding a rectangular sliding frame in the sliding step length of the positive direction of the x axis, and capturing the framed image;

fifthly, judging whether the rectangular sliding frame exceeds the candidate area of the target to be tracked, if so, translating the rectangular sliding frame to the leftmost side of the candidate area of the target to be tracked along the negative direction of the x axis, and then executing the sixth step, otherwise, executing the third step;

and sixthly, calculating the sliding step length in the positive direction of the y axis according to the following formula:

wherein S is_yDenotes the step of sliding in the positive y-axis direction, S_y1Denotes the maximum sliding step in the y-axis direction, S_y2The minimum sliding step length in the y-axis direction is represented, h represents the length of the target to be tracked, v' represents the vertical coordinate of the center point of the current position of the rectangular frame, and v represents the vertical coordinate of the center point of the candidate area of the target to be tracked;

seventhly, sliding the rectangular sliding frame in the sliding step length of the positive direction of the y axis, and capturing the framed image;

eighthly, judging whether the rectangular sliding frame exceeds the candidate area of the target to be tracked, if so, executing the ninth step, otherwise, executing the third step;

and step nine, forming a candidate image sequence by all the intercepted images.