CN109671102B

CN109671102B - Comprehensive target tracking method based on depth feature fusion convolutional neural network

Info

Publication number: CN109671102B
Application number: CN201811467752.3A
Authority: CN
Inventors: 王天江; 冯平; 赵志强; 罗逸豪; 冯琪
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-12-03
Filing date: 2018-12-03
Publication date: 2021-02-05
Anticipated expiration: 2038-12-03
Also published as: CN109671102A

Abstract

The invention discloses a comprehensive target tracking method based on a channel feature fusion convolutional neural network, which belongs to the field of computer vision. On the other hand, a long-term classification prediction sub-network module and a regression prediction sub-network module constructed before tracking are used for training the long-term classification prediction sub-network module and the regression prediction sub-network module by using an information acquisition sample of an initial target, all candidate blocks are classified by using the long-term classification prediction sub-network module in the tracking process, and the tracking is carried out by combining the long-term and short-term classification prediction sub-network module, the regression prediction sub-network module and the multi-template matching module in a self-adaptive mode according to probability results of the candidate blocks belonging to. The method has strong robustness and high accuracy.

Description

Comprehensive target tracking method based on depth feature fusion convolutional neural network

Technical Field

The invention relates to the field of computer vision, in particular to a comprehensive target tracking method based on a depth feature fusion convolutional neural network. The method can effectively improve the success rate and accuracy of tracking the video target in a complex scene.

Background

In modern society, the development of social informatization is faster and faster, and a large number of video acquisition devices exist in work and life of people, and record and store a large amount of video data. On the one hand, it becomes increasingly difficult, and even impossible, to manually analyze and process such data. On the other hand, for these video data, there is a need from many different applications in practical applications, mainly including security monitoring of video, management of intelligent traffic, intelligent human-computer interaction system, analysis of target motion, and automatic driving of motor vehicles, and the video target tracking technology plays an extremely important role in each specific application of video analysis, video understanding, and video interaction, which is a basic technology that needs to be relied upon when such high-order video tasks are performed.

The video target tracking problem is a very active research topic in the field of computer vision, but is also very challenging due to some series of interference factors such as illumination change, scale change, posture change, target occlusion and the like which may exist in a scene.

The video target tracking means that after video data are acquired by using video acquisition equipment, one or more objects are selected from a video to serve as tracking targets, initial central position and scale size information of a target area is given, and the central position and scale size information of the targets in subsequent video frames are predicted by designing an effective target tracking method, so that continuous tracking of the targets is completed.

Although a great deal of application needs exist in work and life of people, a video-based target tracking technology is needed as a basic support, and the target tracking is automatically completed by using a computer vision technology, so that people can be liberated from a great deal of tedious and inefficient tasks, and important bases are provided for analysis and decision making of people. However, various interference factors often occur in a complex real-world scene, so that the target tracking based on the video becomes very difficult.

Therefore, a novel method or system for target tracking based on video needs to be developed to realize target tracking with strong robustness and high accuracy.

Disclosure of Invention

Aiming at the defects or the improvement requirements of the prior art, the invention provides a comprehensive target tracking method based on a depth feature fusion convolutional neural network, which aims to realize target tracking with strong robustness and high accuracy by extracting target depth features, constructing a plurality of different processing modes and fully combining the advantages of a generative model, a discriminant model, long-term tracking and short-term tracking, thereby further providing a good basis for video analysis, video understanding and video interaction and further providing a good technical support for video safety monitoring, intelligent traffic control, target motion analysis, a human-computer interaction system and visual application represented by automatic driving.

In order to achieve the above object, the present invention provides a comprehensive target tracking method based on a depth feature fusion convolutional neural network, also called a single vision target tracking method of a complex scene, which includes the following steps:

(1) modifying the VGG-M network model, adding a convolution layer with channel feature fusion, taking a convolution part in the network as a shared depth feature extraction sub-network, taking the rest part of the network as a sequence specific depth feature classification sub-network, and connecting the two sub-networks to construct a convolutional neural network model with channel feature fusion;

in the method, the number of layers of the convolutional layers and the full-connection layers of the VGG-M network is reduced by the modified network, and only one full-connection layer is reserved in the final deep feature classification sub-network. The method adds a channel feature fusion convolution layer before the full connection layer, contains feature information with basically the same quantity with lower data dimension, and is beneficial to accelerating the speed of a similarity calculation module in a generative model. Before the convolution layer is fused with the convolution layer without adding the channel feature, 512 channel features are output by the convolution layer, and 32 fused channel features are obtained after the convolution layer is added.

(2) Collecting video sequences carrying target position and scale information, and respectively collecting foreground samples and background samples of each video sequence according to target information provided by labels to form a training set of a network model;

some of the scholars and research institutes, among others, provide published video target tracking datasets, several of which contain different challenge factors, including VOT-2013, VOT-2014, and VOT-2015, with duplicate videos removed. And for each selected video sequence, randomly selecting a part of video frame images, and then, for each selected frame of each sequence, sampling by using a Gaussian function of a coordinate parameter and a scale height and width parameter of a target central point according to the marked target position and scale information so as to generate a large number of sample image sub-blocks. And intercepting the images of the sub-block areas, carrying out normalized image processing on the images, defining a foreground class and a background class according to the overlapping ratio relation of the sub-block areas and a real target block area, dividing the foreground class and the background class into two corresponding classes, and reserving the two classes of samples according to a certain proportion to form a training sample set of the network model.

(3) Forming batches of training samples according to a mode corresponding to the sequences, and performing loop iterative training on the network model one by one until set loop times are completed or a preset precision threshold value is reached;

under the influence of the processing speed of the deep neural network, the samples are organized in a batch mode in the training process of the network model. The training of the network adopts a sequence loop mode to carry out iterative training, in particular to a method that a sequence batch sample corresponding to a sequence specific feature classification sub-network is used for a shared feature extraction sub-network and the sequence specific feature classification sub-network one by one in each loop. The convergence condition of the network classification performance can be observed by firstly setting a certain number of circulation times, and when the convergence requirement is not met, the threshold value of the circulation times is increased, otherwise, the iteration times should be properly reduced to avoid the over-fitting problem of the over-depth network.

(4) For a new video sequence, reconstructing a sequence-specific feature classification sub-network module and a sequence-specific regression prediction sub-network module corresponding to the new video sequence, and connecting the sequence-specific feature classification sub-network module and the sequence-specific regression prediction sub-network module with the shared depth feature extraction sub-network to form a new sequence target tracking network model;

specifically, various interference factors such as illumination change, attitude change, target rotation, scale change, motion blur, target occlusion and the like exist in the video sequence in the used training sample set. Therefore, after the network model is subjected to sufficient iterative training by using the samples, the robust deep fusion features can be extracted by means of the shared feature extraction sub-network.

The tracked objects in each video sequence are different, and the tracked objects in a certain sequence can be background or even interference objects similar to the tracked objects in another video. Therefore, for target tracking of a new video sequence, a completely new sequence-specific deep feature classification sub-network needs to be constructed and connected with a trained shared feature extraction sub-network to form a classification prediction network model used in a tracking process. In addition, the method of the invention also utilizes a regression prediction network module which is also used for constructing a sequence-specific depth feature regression prediction sub-network module for the new video sequence.

(5) Acquiring initial foreground samples and background samples by using the position and scale information of a target in a new sequence first frame, training a newly constructed feature classification sub-network by using the samples, training a regression prediction sub-network module by using a positive sample in the regression prediction sub-network module, extracting depth features of the initial target by using a shared depth feature extraction sub-network, and taking the extracted features as an initial feature template of the target;

the feature classification prediction sub-network module and the feature regression prediction sub-network module which are needed to be used in the tracking of the new video sequence are constructed in the last step, samples of a foreground class and a background class are collected in an initial frame according to the information of an initial target of the new sequence, the long-term classification prediction sub-network module is obtained by training the classification sub-network with all samples, and the regression prediction sub-network module is trained with the foreground class samples. And outputting the final convolution layer of the target initial region as characteristic processing, and storing the characteristic processing as an initial target characteristic template.

(6) A plurality of different target characteristic templates are used in the target tracking process, wherein the initial set of historical target characteristic templates is set to be empty, and the target characteristic template of the previous frame is set to be the initial target characteristic template;

the method is a comprehensive target tracking method, wherein a generating module of a multi-template matching strategy is used. The initial and last tracking information of the target are respectively contained in the initial frame and the last frame of the current frame, and in addition, in the previous target tracking process, the appearance characteristics of the target may have some appearance characteristic information which can repeatedly appear in the subsequent tracking. And respectively constructing an initial target feature template, a previous frame target feature template and a historical target feature template according to the information, setting the historical feature template set to be empty before target tracking is carried out, and setting the previous frame target feature template as the initial target feature template.

(7) Generating candidate regions of the target by using the latest target position and scale information, extracting the depth features of the regions by using a shared feature extraction sub-network, and respectively calculating the classification probability that the regions belong to a foreground class and a background class;

in general, the motion of the target has certain regularity, and the change of the position and the scale of the target in a new frame is likely to be gaussian distribution relative to the position and the scale of the target in the previous frame, so that the regions of candidate targets can be generated by using a gaussian function, then the shared feature extraction sub-network of the network model is used for extracting features of the generated candidate target regions, and the classification probabilities of the candidate targets belonging to the foreground class and the background class are further calculated from the network by using classification prediction.

(8) Judging the change degree of the target appearance according to the probability result that the depth features of all candidate blocks belong to the foreground class, comparing the probability values with a set threshold value, and taking the comparison result as a condition, namely whether the probability values of all candidate blocks belonging to the foreground class are all larger than the set threshold value;

(9) when the probability values of all candidate blocks belonging to the foreground class are greater than the set threshold value, the judgment condition of the last step is satisfied, the appearance change degree of the target is small, the probability of correct recognition by the long-term classification prediction sub-network module is high, and the long-term classification prediction sub-network module and the regression prediction network module are combined to analyze and calculate the comprehensive prediction value;

otherwise, when the judgment condition is not satisfied, the appearance of the target is possibly changed greatly, at the moment, a short-term classification prediction sub-network module is newly constructed, and the long-term and short-term classification prediction sub-network module is combined with the multi-template matching module to analyze and calculate a comprehensive prediction value;

since the long-term classification prediction sub-network module is updated only with samples collected during tracking with relatively high confidence, when the appearance of the target changes to a large extent, this module may tend to classify all candidate blocks into the background class, and the probability that all candidate blocks belong to the foreground class is less than the set threshold. At the moment, severe tracking drift is easy to occur only by using the long-term classification prediction sub-network module, so a new short-term classification prediction sub-network module is constructed, and the comprehensive prediction value is calculated by combining the long-term classification prediction sub-network module and the multi-template matching generation module. On the contrary, if the probability that part of candidate blocks belong to the foreground class is larger than the set threshold, the target appearance is not changed greatly, and the comprehensive predicted value is calculated by only combining the long-term classification prediction sub-network module and the regression prediction sub-network module.

(10) Taking a candidate block with the highest predicted value as a target tracking result of a current frame, updating a target feature template of the previous frame into a new target block feature, acquiring samples according to a new target position and scale information, adding the samples into a sample set for updating a short-term classification prediction sub-network module, and analyzing the probability of all candidate blocks belonging to a foreground class, thereby determining whether to add the candidate blocks into the sample set of the long-term classification prediction sub-network module, and whether to generate a new historical target feature template and an updating network;

after the comprehensive predicted values of all candidate blocks are obtained through combined calculation of different modules, the block with the largest predicted value is selected as the target of the current frame, then the target feature of the previous frame used in the multi-template strategy is replaced by the depth feature of the new target block, samples are collected according to the new target position and scale information, and the samples are added to update the sample set of the short-term classification prediction sub-network module.

Classifying the characteristics of all candidate blocks by using a long-term classification prediction sub-network module, wherein the obtained result can objectively reflect the size of the change degree of the appearance of the target, if the probability that all candidate blocks belong to the foreground class is not high, the current tracking result can be considered to have low reliability, the condition indicates that the appearance characteristics of the target have obvious change, at the moment, the long-term classification prediction sub-network module is updated by using samples in a sample set with high reliability collected in the tracking process, and meanwhile, the depth characteristics of a new target block are added into a historical target characteristic template set;

and otherwise, adding the collected samples into the sample set of the long-term classification prediction sub-network module.

(11) And (4) judging whether the tracking is finished or not, if not, jumping to the step (7), and sequentially and circularly executing the step (7) to the step (11).

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

the method disclosed by the invention comprises the following steps: constructing a two-classification depth feature fusion convolutional neural network model, wherein the network model comprises a shared feature extraction sub-network and sequence-specific feature classification sub-networks which are in one-to-one correspondence with the tracking sequences; selecting a video sequence from the marked video tracking public data set to construct a training set, collecting foreground samples and background samples from the training set, and performing sequence round-robin iterative training on the network model by using the collected sequence samples. When a target in a new video sequence is tracked, various parameters in the feature extraction sub-network are kept fixed, and a sequence-specific feature classification sub-network and a sequence-specific regression prediction sub-network module are reconstructed for the new sequence; acquiring initial sequence-related foreground-background classification samples according to the first frame target position and scale information of the new sequence, and training a newly-constructed sequence-specific feature classification sub-network and a regression prediction sub-network module by using the samples; in the process of target tracking, candidate blocks are generated according to the latest target position and scale information, the latest network is used for extracting features and classifying the candidate blocks, when the probability that all the candidate blocks belong to the foreground class is greater than a set classification threshold value, the long-term classification prediction sub-network module and the regression prediction sub-network module are combined for prediction, and a sample is collected according to new target state information and stored in a sample set of the long-term classification prediction sub-network module; otherwise, constructing and training a short-term sequence specific classification sub-network module, combining the long-term and short-term classification prediction sub-network module with the multi-template matching module for prediction, updating the long-term classification prediction sub-network module by using samples collected in the tracking process, and adding the depth features of the new target area into the historical target feature template set; collecting samples according to the new target state information and storing the samples into a sample set of a short-term classification network module; and taking the candidate block with the highest predicted value as a new target tracking result. The invention extracts features through a depth feature fusion convolutional neural network and provides an integrated target tracking method based on the fusion features, the design is simple, and the prediction precision of target tracking can be effectively improved through a long-term and short-term classification prediction sub-network module and an integrated target tracking model combined by a multi-template matching module.

Drawings

Fig. 1 is a schematic frame diagram of a principle of an integrated target tracking method based on a channel feature fusion convolutional neural network in an embodiment of the present invention.

Fig. 2 is a schematic network structure diagram of a channel feature fusion convolutional neural network in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention mainly aims to provide a comprehensive tracking method based on a channel feature fusion convolutional neural network for the tracking problem of a single visual target in a complex scene, which constructs a plurality of different processing modules by extracting depth features with good robustness on target rotation, illumination change, posture change, target shielding and the like, fully combines the advantages of a generating model, a discriminating model, long-term tracking and short-term tracking, realizes target tracking with strong robustness and high accuracy, further provides a good foundation for video analysis, video understanding and video interaction, and further provides good technical support for video safety monitoring, intelligent traffic control, target motion analysis, a man-machine interaction system and visual application represented by automatic driving.

The invention provides a comprehensive target tracking method based on a channel characteristic fusion convolutional neural network. On one hand, a new channel feature weighting convolution layer is added into the network structure, and a convolution neural network suitable for target tracking is constructed to be used for extracting depth features as appearance representation, so that the features which are originally sparse but have a large number of channels contain the features with basically the same information quantity in a lower feature dimension, and the calculation of similarity is facilitated to be accelerated. On the other hand, a long-term classification prediction sub-network module and a regression prediction sub-network module constructed before tracking are used for collecting samples and training the long-term classification prediction sub-network module and the regression prediction sub-network module by using the information of the initial target, all candidate blocks are classified by using the long-term classification prediction sub-network module in the tracking process, and the tracking is carried out by combining the long-term classification prediction sub-network module, the regression prediction sub-network module and the multi-template matching module in a self-adaptive mode according to probability results of the candidate blocks belonging to the foreground class. Similarly, the reconstruction of the short-term classification prediction sub-network module, the updating of the regression prediction sub-network module, the collection of samples in the tracking process and the generation of the historical target feature template are carried out in a self-adaptive mode according to the probability result that all candidate blocks belong to the foreground class.

Fig. 1 is a schematic frame diagram of an integrated target tracking method based on a channel feature fusion convolutional neural network in an embodiment of the present invention, where the method mainly includes the following steps:

(1) modifying the number of layers of a classic classification network VGG-M and the size of a convolution kernel of each convolution layer, adding a new depth feature fusion convolution layer, taking a convolution part before a full connection layer as a feature extraction sub-network shared by all sequences, constructing a sequence-specific feature classification sub-network comprising the full connection layer and a function layer for each tracked sequence, and connecting the two sub-networks together to form a depth feature fusion convolution neural network model;

during tracking, some blocks in each frame of image are analyzed, the size is smaller, therefore, the VGG-M network is modified to make the normalized input image size received by the network be 107 × 3, the convolution layers of the network are reduced to 3, the sizes of convolution kernels of each layer are respectively 7 × 3 × 96, 5 × 96 × 259 and 3 × 256 512, the first two parameters of the sizes of the convolution kernels represent the sizes of the kernels, and the last two parameters represent the numbers of characteristic channels before convolution and after convolution respectively. The convolution operation steps of the first two convolution layers are 2 x 2, and the convolution step of the third convolution layer is 1 x 1. Among the convolutional layers, there are a ReLU layer, a normalization layer and a pooling layer, where the pooling layer has a pooling scale of 3 × 3 and a step size of 2 × 2. A third convolution layer is followed by a ReLU layer, a channel feature fusion convolution layer and a ReLU layer are added, the convolution kernel size of the feature channel convolution layer is 1 x 512 x 32, the step size of the convolution operation is 1 x 1, the aforementioned layers jointly form a sequence-shared feature extraction sub-network, and the feature sharing sub-network is followed by a fully-connected layer with a convolution kernel size of 3 x 512 x 2 and a function layer, which jointly form a sequence-specific feature classification sub-network. The method of the present application also uses a sequence-specific regression prediction subnetwork module, structurally similar to the feature classification subnetwork, except that the convolution kernel size is 3 x 512 x 1, while the function layer uses the logistic function instead of the softmax function.

(2) In order to train and obtain a network model aiming at a tracking problem, tracking videos with target position and scale information are collected, and foreground samples and background samples are collected for each video sequence by using state information of a target, so that a training set of the network model is formed;

and selecting the video sequence with different challenging factor scenes and containing the label for sampling the network model training sample. Randomly selecting 8 frames of images from each video sequence, defining foreground class and background class samples on the images according to the position and scale information given by the mark of the target, and respectively collecting 50 samples and 200 samples of the two classes. The foreground-type and background-type samples are defined according to the size of the overlapping ratio of the areas of the sample region and the region labeled by the real target, and two thresholds are set, wherein one threshold is 0.7, and the other threshold is 0.5. If the ratio of the area overlap is greater than or equal to 0.7, the corresponding sample is defined as a foreground class sample, whereas if the ratio of the area overlap is less than 0.5, the corresponding sample is defined as a background class sample.

(3) Forming batches of the collected training samples according to different sequences, and performing sequence type cyclic iterative training on the network model by using the batches until a set cycle number is reached or the error rate of the network is lower than a preset threshold value;

the initial network model is iteratively trained 150 times using a loop of video sequences in a training set, this process being primarily to learn the convolution kernel parameters of a shared feature extraction sub-network. For each time of the training of the loop iteration, 32 foreground samples and 96 background samples are randomly selected from all samples of each sequence in the training set to form a sample batch used by one iteration of the sequence.

(4) For a new video sequence, reconstructing a sequence-specific feature classification sub-network and a sequence-specific regression prediction sub-network module corresponding to the new video sequence, and connecting the sequence-specific feature classification sub-network and the sequence-specific regression prediction sub-network module with a shared depth feature extraction sub-network so as to form a new network model used in sequence target tracking;

(5) acquiring initial foreground samples and background samples by using the position and scale information of a target in a new sequence first frame, training a newly constructed feature classification sub-network by using the samples, training a regression prediction sub-network module by using positive samples in the regression prediction sub-network module, extracting depth features of the initial target by using a shared depth feature extraction sub-network, and taking the extracted features as an initial feature template of the target;

respectively acquiring 500 foreground samples and 5000 background samples by utilizing the position and scale information of a target in a sequence initial frame, similarly, randomly selecting 32 foreground samples and 96 background samples from the samples each time as a sample batch for network receiving and processing, and performing 20 times of cyclic iterative training, thereby realizing the training and learning of the fully-connected layer parameters specific to the newly-constructed sequence. And then vectorizing and normalizing the depth features of the initial target block extracted by the shared feature extraction sub-network, and taking the result as an initial target feature template.

(6) In the target tracking process, a plurality of different target characteristic templates are used, wherein an initial historical target characteristic template set is empty, and a target characteristic template of a previous frame is set as an initial target characteristic template.

(7) Generating candidate regions of the target by using the latest target position and scale information, extracting the depth characteristics of the regions by using a network model and calculating the classification probability of the regions belonging to the foreground class and the background class;

and generating 256 candidate sample blocks by using the coordinates of the central point and the Gaussian function of the length-width scale according to the latest position and scale information of the target, extracting the depth features of the 256 candidate sample blocks, and classifying the features of the candidate blocks by using the latest long-term classification prediction sub-network module.

(8) Judging the change degree of the target appearance according to the result that the depth features of all candidate blocks belong to the foreground class probability, comparing the probability values with a set threshold, and taking the compared result as a judgment condition, namely whether the probability values of all candidate blocks belonging to the foreground class are all larger than the set threshold;

the threshold used for comparison in the determination condition is set to 0.55, and a value higher than this threshold indicates that the probability of belonging to the foreground class is higher than the probability of belonging to the background class, and a result lower than this threshold is known by reverse extrapolation.

(9) When the judgment condition of the last step is satisfied, namely the probability of the foreground class is higher than the probability of the foreground class belonging to the background class, the appearance change degree of the target is small, the probability of the target being correctly identified by the long-term classification prediction sub-network module is higher, and the long-term classification prediction sub-network module and the regression prediction sub-network module are combined to analyze and calculate the comprehensive prediction value;

when the judgment condition is not satisfied, namely the possibility of the foreground class is lower than that of the foreground class, the target appearance is indicated to be changed greatly, at the moment, a short-term classification prediction sub-network module is newly constructed, and the long-term and short-term classification prediction sub-network module is combined with the multi-template matching module to analyze and calculate a comprehensive prediction value;

when the judgment condition of the previous step is satisfied, selecting the blocks with the probability of being in the foreground class being more than 0.5 from the candidate blocks, and combining a long-term classification prediction sub-network module and a regression prediction sub-network module, wherein the former takes the probability value of being in the foreground class, the weight is fixed to be 1, the output value of the regression prediction sub-network module directly represents the probability of the blocks being targets, and the weight is set to be the average value of 5 values with the highest probability value of being in the foreground class in the selected blocks (if the blocks meeting the condition are less than 5, the average value of all the values is taken). Otherwise, when the condition is not satisfied, selecting the first 50 blocks with the probability values of the foreground class sorted from big to small from the candidate blocks, constructing a new short-term classification prediction sub-network module, training by using samples of the latest three frames, calculating the probability value of the selected block belonging to the foreground class by using the candidate blocks, and performing weighted combination on the probability values calculated by the long-term and short-term classification prediction network module, wherein the weight of the long-term module is set to be 1, and the weight of the short-term classification prediction sub-network module is determined by using the proportion of the partial frame foreground class samples with higher reliability in the tracking process which are correctly classified by the short-term classification prediction sub-network module; then, calculating the similarity of the depth features of the selected block and the three target feature templates used in the method by using the EMD distance, and weighting to obtain a comprehensive matching value, wherein the weighting weights of the three similarities are respectively set according to the following formula:

ω_f＝C1 (1)

wherein, ω is_f、ω_l、ω_hRespectively representing the weighted weight p of the weighted summation of three similarity degrees related to the initial target feature template, the target feature template of the previous frame and the historical target feature template set^*(t-1)The probability of belonging to the foreground class calculated when the tracking result of the previous frame uses the long-term classification prediction sub-network module is shown, and the parameters C1, C2, alpha and beta are four different constants which are respectively set to be 2, 0.2, 0.5 and 0.01. And finally, the probability predicted value and the matching value are weighted and combined to calculate a comprehensive predicted value, and the weights of the probability predicted value and the matching value are respectively set to be 0.7 and 0.3.

(10) Taking a candidate block with the highest predicted value as a target tracking result of a current frame, updating a target feature template of the previous frame into a new target block feature, acquiring samples according to a new target position and scale information, adding the samples into a sample set for updating a short-term classification prediction sub-network module, and analyzing the probability of all candidate blocks belonging to a foreground class, thereby determining whether to add the samples into the sample set of the long-term classification prediction sub-network module and whether to facilitate the updating of a historical target feature template and a network model;

and taking the block with the highest comprehensive predicted value as a target tracking result of the current frame, replacing the target feature template of the previous frame with the depth feature of the block, collecting foreground samples and background samples by utilizing the position and scale information of the latest target block, and storing the foreground samples and the background samples into a sample set used for training after constructing the short-term classification prediction sub-network module. For the current frame, if candidate blocks with foreground class probability values greater than or equal to 0.6 exist in the classification result of the long-term classification prediction sub-network module, the sampled samples are also added into a sample set used for subsequent updating of the long-term classification prediction sub-network module. And on the contrary, the appearance of the new target block is considered to be changed obviously, at this time, the collected foreground samples in the latest 20 frames (all the latest frames are taken if the number of the frames is less than 20) and background samples in the latest 50 frames (all the latest frames are taken if the number of the frames is less than 50) in the sample set used by the long-term classification prediction sub-network module are used for updating the long-term classification prediction sub-network module, and meanwhile, the depth feature processed by the latest target block is taken as a new historical target feature template and is stored in the historical target feature template set.

(11) And (4) judging whether the tracking is finished or not, and if not, circularly executing the steps (7) to (11).

Fig. 2 is a schematic diagram of a network structure of a channel feature fusion convolutional neural network in an embodiment of the present invention, where Conv in the diagram indicates a convolutional layer, the following numbers indicate that the layer is the second convolutional layer, and K in parentheses below: n x n denotes the dimension of the kernel, s is the step size of the operation; also posing represents a convolutional layer, ReLu represents a modified linear unit layer, and normaize represents a normalization layer; feature fusion and full connection denote feature fusion layers and full connection layers, respectively, which are special forms of convolutional layers.

In the following, a test of a plurality of video sequences is taken as an example, two evaluation indexes of target tracking are introduced, and a target tracking result obtained by using the tracking method proposed in the present invention is shown.

Two indexes for evaluating the target tracking accuracy are mainly used, one is to measure the central position of the tracking result and the target true by using the Euclidean distanceThe Error of the Center position in the real state is called a Center Location Error (CLE), and it is obvious that the smaller the euclidean distance of the Center is, the smaller the Error is, the more accurate the tracking is; another method is a Ratio of overlapping area of the region for measuring the tracking result with the target real state, which is called an Overlap Ratio (OR), and when the higher the Overlap Ratio of the region area indicates the higher the predicted Overlap Ratio, the more accurate the tracking result is. The evaluation of the accuracy of the tracking result of the whole video sequence is to take the average value of the single-frame evaluation results for comparison, and suppose that the abscissa and the ordinate of the central position of the prediction result obtained by a certain frame tracking method and the area of the region are respectively recorded as (x)_p，y_p) And R_pThe abscissa and ordinate of the center position of the corresponding real target and the area of the region are respectively expressed as (x)_g，y_g) And R_gThen, the calculation formulas of the two evaluation indexes are as follows:

the first row in the table is the name of the different video sequences, the smaller the value of the CLE index the more accurate the center position, and the larger the value of the OR index the higher the degree of coincidence. As can be seen from the above table, the method of the present invention can obtain the tracking effect with small center position deviation and high coincidence degree in the above video sequence.

The invention fully utilizes advanced image processing and pattern recognition technology provided in the field of computer vision, and effectively completes video target tracking in complex scenes.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A comprehensive target tracking method based on a depth feature fusion convolutional neural network is characterized by comprising the following steps:

(1) modifying the VGG-M network model, adding a convolution layer with channel feature fusion, taking the convolution part as a shared depth feature extraction sub-network, taking the rest part as a sequence specific depth feature classification sub-network, and connecting the two to construct a convolutional neural network model with channel feature fusion;

(2) collecting video sequences carrying target position and scale information, and respectively collecting foreground samples and background samples of each video sequence according to target information provided by labels to form a training set of a convolutional neural network model;

(3) forming the training samples into batches in a sequence corresponding mode, and performing cyclic iterative training on the convolutional neural network model one by one until the set cycle number is completed or a preset precision threshold value is reached;

(4) for a new video sequence, reconstructing a sequence-specific depth feature classification sub-network module and a sequence-specific regression prediction sub-network module corresponding to the new video sequence, and connecting the two network modules with a shared depth feature extraction sub-network to form a new sequence target tracking network model;

(5) acquiring initial foreground samples and background samples by using the position and scale information of a target in a new video sequence first frame, training a newly-constructed sequence-specific depth feature classification sub-network module by using the two samples, training a sequence-specific regression prediction sub-network module by using a positive sample, extracting depth features of the initial target by using a shared depth feature extraction sub-network, and taking the extracted features as an initial feature template of the target,

outputting the final convolution layer of the target initial region as a feature process, and storing the processed result as an initial target feature template;

(7) generating a candidate region of the target by using the latest target position and scale information, extracting depth features of the candidate region of the target by using a shared depth feature extraction sub-network, and respectively calculating the classification probability of the depth features belonging to a foreground class and a background class;

(8) judging the change degree of the target appearance according to the probability results that the depth features of all the candidate regions belong to the foreground class, comparing the probability values with a set threshold value, and taking the comparison result as a condition, namely whether the probability values of all the candidate regions belonging to the foreground class are all larger than the set threshold value;

(9) when the probability values of all candidate regions belonging to the foreground class are larger than a set threshold value, the appearance change degree of the target is small, the probability of correct recognition by the depth feature classification sub-network module specific to the long-term sequence is high, and at the moment, the depth feature classification sub-network module specific to the long-term sequence and the regression prediction sub-network module specific to the sequence are combined to analyze and calculate a comprehensive predicted value;

otherwise, the appearance of the target is greatly changed, at the moment, a short-term sequence specific depth feature classification sub-network module is newly constructed, and the long-term sequence specific depth feature classification sub-network module and the short-term sequence specific depth feature classification sub-network module are combined with the multi-template matching module to analyze and calculate a comprehensive predicted value;

(10) taking a candidate block with the highest predicted value as a target tracking result of a current frame, updating a target feature template of the previous frame into a new target block feature, acquiring a sample according to a new target position and scale information, adding the sample into a sample set of a depth feature classification sub-network module for updating short-term sequence specificity, and analyzing the probability of all candidate areas belonging to a foreground class, thereby determining whether to add the candidate areas into the sample set of the depth feature classification sub-network module for long-term sequence specificity, and whether to generate a new historical target feature template and an updating network;

after the comprehensive predicted values of all candidate areas are obtained through combination calculation, the area with the largest predicted value is selected as the target of the current frame, then the target feature of the previous frame used in the multi-template strategy is replaced by the depth feature of the new target area, samples are collected according to the new target position and scale information, and the samples are added into a sample set used for updating the short-term sequence specific depth feature classification sub-network module;

the long-term sequence specific depth feature classification sub-network module is used for classifying the features of all candidate regions, the obtained result reflects the degree of the appearance change of the target more objectively,

if the probability that all candidate areas belong to the foreground class is not high, judging that the reliability of the current tracking result is not high, wherein the situation shows that the appearance characteristics of the target are changed obviously, updating the long-term classification prediction sub-network module by using samples in a sample set with high reliability collected in the tracking process, and adding the depth characteristics of a new target area into a historical target characteristic template set;

otherwise, adding the collected samples into the sample set of the long-term sequence specific depth feature classification sub-network module;

(11) it is determined whether the tracking is finished or not,

and if not, jumping to the step (7), and sequentially and circularly executing the step (7) to the step (11).

2. The method for comprehensive target tracking based on the depth feature fusion convolutional neural network of claim 1, wherein in step 3,

under the influence of the processing speed of the deep neural network, a sample batch mode is adopted for organization in the training process of the network model, the training of the network adopts a sequence cycle mode for iterative training, in particular to a method that a shared deep feature extraction sub-network and a sequence specific deep feature classification sub-network use sequence batch samples corresponding to the sequence specific feature classification sub-network one by one in each cycle,

the convergence condition of the network classification performance can be observed by firstly setting a certain number of circulation times, the threshold value of the circulation times is increased when the convergence requirement is not met, otherwise, the iteration times are reduced for avoiding the over-fitting problem of the over-depth network.

3. The method for comprehensive target tracking based on the depth feature fusion convolutional neural network of claim 2, wherein in step 4,

aiming at the target tracking of a new video sequence, a brand-new sequence-specific depth feature classification sub-network is constructed and connected with a trained shared depth feature extraction sub-network to form a classification prediction network model used in the tracking process,

a sequence-specific regression prediction sub-network module is also constructed for the new video sequence.

4. The method according to claim 3, wherein in step 6,

and respectively constructing an initial target feature template, a previous frame of target feature template and a historical target feature template, setting the historical feature template set to be empty before target tracking is carried out, and setting the previous frame of target feature template as the initial target feature template.

5. The method according to claim 4, wherein in step 7,

candidate target regions are generated by utilizing a Gaussian function, then depth features are extracted from the generated candidate target regions by utilizing a shared depth feature extraction sub-network of a network model, and classification probabilities of the candidate target regions belonging to a foreground class and a background class are further calculated by utilizing a sequence-specific depth feature classification sub-network module.

6. The integrated target tracking method based on the depth feature fusion convolutional neural network of claim 5, wherein in step (9),

when the appearance of the target changes to a large extent, a new short-term classification prediction sub-network module is constructed and is combined with the long-term classification prediction sub-network module and the multi-template matching generation module to calculate a comprehensive prediction value,

and otherwise, combining the long-term classification prediction sub-network module with the regression prediction sub-network module to calculate the comprehensive prediction value.