CN114399531A

CN114399531A - Unsupervised target dense tracking method based on video coloring

Info

Publication number: CN114399531A
Application number: CN202111609449.4A
Authority: CN
Inventors: 杜森; 宋爱波; 方效林; 袁庆丰; 杨明; 朱同鑫
Original assignee: Nanjing Chuangsiqi Technology Co ltd
Current assignee: Nanjing Chuangsiqi Technology Co ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-26

Abstract

The invention discloses an unsupervised target intensive tracking method based on video coloring, which comprises the steps of constructing a target tracking model based on various video sample frames which are arranged in a time sequence and respectively comprise preset target objects, and then completing the tracking of the preset target objects by applying the target tracking model, wherein the target tracking model comprises a feature extraction network, a dynamic adjusting module and a target predicting module, wherein the feature extraction network is used for acquiring feature maps corresponding to the video sample frames, the dynamic adjusting module is used for outputting a feature map group formed by preset number of feature maps in all the obtained feature maps, and the target predicting module is used for predicting position parameters and area parameters of the preset target objects in a next frame of feature maps which output the feature maps in real time in the feature extraction network.

Description

Unsupervised target dense tracking method based on video coloring

Technical Field

The invention relates to a target dense tracking method, in particular to an unsupervised target dense tracking method based on video coloring.

Background

The problem of target tracking is a widely studied problem in computer vision, and the existing target tracking usually needs a large number of image tags and the tracked target types are limited, so the problem of unsupervised target tracking causes extensive attention and intensive research in academia. The existing unsupervised target tracking problem mainly comprises four types of solutions, which are respectively: corresponding stream based methods, time cycle consistency based methods, multiple mesh prediction filter based methods, and video shading based methods. The invention is based on a video rendering method, the general flow of which is: a certain frame in a given video is traversed through each pixel point p of the frame_iFinding the point set { q } most similar to the point in a specific area in the reference frame or set of reference frames_jW, the similarity degree of the point in the point set and the point is taken as the weight w_ijWeighted summation of the label values of points in a point set

Obtaining the predictive label value of each pixel point of the frame

The prediction accuracy of such methods generally does not exceed 40% (F)&J-mean index), the accuracy reaches about 65% through the improvement of MAST, the tracking effect is qualitatively improved, the highest level of the accuracy of the similar algorithm is achieved, but a certain difference exists between the similar algorithm and the supervised target tracking method, for example, the accuracy of the PReMPOS can reach 75% to 80%. Therefore, there is still a need and room for improvement in the unsupervised target tracking method.

Disclosure of Invention

The purpose of the invention is as follows: the unsupervised target intensive tracking method based on video coloring is provided, and unsupervised target intensive tracking with high prediction precision and high tracking quality is realized on a preset target object.

In order to realize the functions, the invention designs an unsupervised target intensive tracking method based on video coloring, which comprises the following steps of S1-S5, obtaining a target tracking model, and then completing the tracking of a preset target object by applying the target tracking model;

s1, obtaining video sample frames which are arranged in a time sequence and respectively comprise a preset target object;

s2, constructing a feature extraction network by taking the video sample frame as input and the feature map corresponding to the video sample frame as output based on the convolutional neural network and the SRM module, wherein the first period of the feature extraction network for outputting the feature map is T₁；

S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T₂And T is₂＞T₁；

S4, constructing a reference frame group based on the feature graph output by the feature extraction network in real time and the feature graph group output by the dynamic adjustment module, and constructing a target prediction module by taking the reference frame group as input and taking the position parameter and the area parameter of a preset target object in the feature graph of the next frame of the feature graph output by the feature extraction network in real time as output;

and S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model.

As a preferred technical scheme of the invention: feature extraction network, dynamic tuningThe modules and the target prediction module are sequentially connected in series, and the feature extraction network adopts a first period T₁Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T₁Receiving the characteristic diagram in real time and taking a second period T₂And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.

As a preferred technical scheme of the invention: the feature extraction network, the dynamic adjustment module and the target prediction module are connected with each other in pairs, and the feature extraction network adopts a first period T₁Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T₁Receiving the characteristic diagram in real time and taking a second period T₂Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T₁The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module₂And the feature map groups output to the target prediction module together construct a reference frame group as the input of the target prediction module.

As a preferred technical scheme of the invention: in step S2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weight of each channel of the feature extraction network output feature map, which specifically includes the following steps:

s21: the convolutional neural network takes a video sample frame as input, takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, W is the width of the feature map, and the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated to obtain a matrix with the dimensionality of C multiplied by 4;

s22: and performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.

As a preferred technical scheme of the invention: the reference frame group includes a long-term reference frame group and a short-term reference frame group, and step S4 specifically includes the following steps:

s41: taking a next frame feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is a tth frame, corresponding one label to each pixel point belonging to a preset target object in each feature map, comparing the number of labels in the feature map output by the feature extraction network in real time with the number of labels in the previous frame feature map, and selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group if the change rate of the labels is greater than or equal to a preset value, and selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group if the change rate of the number of labels is less than the preset value;

when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-2 th and t-3 rd frames, taking the characteristic diagrams of the t-3 rd frame in the past preset number as candidate long-term reference frames along the historical time direction;

when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-3 th and t-5 th frames, taking the characteristic diagrams of the t-5 th frame in the past preset number as candidate long-term reference frames along the historical time direction;

s42: taking pixel points at preset positions in a query frame as query pixel points, searching pixel points belonging to a preset target object in candidate pixel points in each candidate long-term reference frame, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long time;

constructing a candidate pixel point based on an expansion rate dil, wherein the expansion rate dil is as follows:

in the formula, C_t-1For querying the centroid, C, of the feature map of the frame preceding the frame_rIs the centroid of the frame reference frame, H is the height of the feature map, and W is the feature mapWidth, the pixel points in the range corresponding to the expansion rate are candidate pixel points;

s43: respectively dividing square areas by taking coordinates of all query pixel points in a query frame, which correspond to all short-term reference frames, as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in all the square areas, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;

s44: selecting a reference frame set for inputting a target prediction module based on coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:

in the formula, l _ m is a pixel point marked as belonging to a preset target object in long-term memory, and s _ m is a pixel point marked as belonging to the preset target object in short-term memory;

taking candidate long-term reference frames with ratio values larger than 0.9 and smaller than 1.05 as a long-term reference frame group, and constructing the reference frame group according to the following method:

(1) when IoU is more than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;

(2) when the reference frame group is not less than 0.6 and not more than IoU and not more than 0.9, constructing the reference frame group by the long-term reference frame group and the short-term reference frame group;

(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed by the reselected long-term reference frame group and the short-term reference frame group.

As a preferred technical scheme of the invention: the preset value of the rate of change in the number of tags is 10%.

As a preferred technical scheme of the invention: after the long-term reference frame group is newly selected in step S44, the foreground and background segmentation is performed on each long-term reference frame according to the long-term memory and the short-term memory based on the grabcut method.

Has the advantages that: compared with the prior art, the invention has the advantages that:

the invention designs an unsupervised target dense tracking method based on video coloring, which integrates an SRM module into a residual error module of a feature extraction network, retrains the feature extraction network, and can enhance the capability of the network for extracting features; then, a dynamic reference frame adjusting mechanism and a foreground and background segmenting mechanism are combined, a proper reference frame is selected according to the result of the relevant parameters, the label is spread in the query frame, the tracking scene with the violent change of the target can be adapted, and the condition that the label is scattered on the background is reduced; overall, the model may improve the accuracy of target tracking in various scenarios.

Drawings

Fig. 1 is a schematic diagram of an SRM module provided according to an embodiment of the invention;

fig. 2 is a schematic diagram of a residual module provided in accordance with an embodiment of the invention in combination with an SRM module;

FIG. 3 is a flow chart of a mechanism for dynamically adjusting reference frames according to an embodiment of the invention;

fig. 4 is a schematic diagram of a foreground and background segmentation mechanism based on the grabcut method according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The invention provides a video coloring-based unsupervised target intensive tracking method, which comprises the following steps of S1-S5, obtaining a target tracking model, and then completing the tracking of a preset target object by applying the target tracking model;

s2. based on convolution neural networkAnd the network and SRM module takes the video sample frame as input and takes the feature graph corresponding to the video sample frame as output to construct a feature extraction network, wherein the first period of the feature extraction network output feature graph is T₁；

Referring to fig. 1 and 2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weights of the channels of the feature extraction network output feature map, where the convolutional neural network may use ResNet-18 and ResNet-50, and its specific steps are as follows:

s21: the original convolution part of the convolution neural network is not changed, the convolution neural network takes a video sample frame as input and takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, and W is the width of the feature map, the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated, a matrix with the dimensionality of C multiplied by 4, called a style matrix, is obtained, and the value of the entropy of the feature map is calculated according to the following formula according to the channels of the feature map:

wherein χ is a set of intervals, and is composed of 20 small intervals with the range of-1 to 1 and the step length of 0.1. p (X) is the probability that the eigenvalue in X falls within the interval X, where 0 < p (X) < 1. The function of the standard deviation, mean and maximum is derivable, while the calculation function of the entropy is not derivable, so the derivative of the entropy needs to be customized:

since p (X) is a discrete function, it is heuristically set that when the feature vector X is uniformly distributed,

if p (x) is 0.05 and p' (x) is 0.025, the derivative of the entropy solving function is given by the following equation:

In one embodiment, the feature extraction network, the dynamic adjustment module and the target prediction module are connected in series in sequence, and the feature extraction network has a first period T₁Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T₁Receiving the characteristic diagram in real time and taking a second period T₂And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.

In one embodiment, the feature extraction network, the dynamic adjustment module, and the target prediction module are connected to each other two by two, the feature extraction network having a first period T₁Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T₁Receiving the characteristic diagram in real time and taking a second period T₂Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T₁The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module₂And the feature map groups output to the target prediction module together construct a reference frame group as the input of the target prediction module.

the reference frame group includes a long-term reference frame group and a short-term reference frame group, and step S4 specifically includes the following steps:

the position of a preset target object is represented by a mass center, the expansion rate of the preset target object in long-term memory is mainly influenced, the expansion rate represents the distance between horizontal and vertical coordinates of adjacent candidate pixel points, the larger the distance is, the larger the receptive field is, the whole reference frame is basically covered when the value is 4, the farther the position of the preset target object in a query frame deviates from the position of the preset target object in the query frame, the larger the expansion rate is, the candidate pixel points are constructed based on the expansion rate dil, and the expansion rate dil is as follows:

in the formula, C_t-1Center of mass, C, of feature map of frame before query frame for preset target object_rThe centroid of a preset target object in the frame reference frame, H is the height of the feature map, W is the width of the feature map, and pixel points in the range corresponding to the expansion rate are candidate pixel points;

in the experimental process, C in the result is predicted according to long-term memory_t-1And C_rCloser in position relative to the true center of mass, since the target is from C_rMove to C_t-1When the label is scattered on the background, if dil is 0, the label is not different from the prediction process of the short-term memory, so dil is 1 at the minimum, and then the problem of centroid shift is considered, so that 2 ≦ dil ≦ 4 is set in one embodiment, and the specific value is determined by the ratio of the centroid shift to the maximum picture offset.

S44: the similarity degree of the long-term reference frame and the query frame is measured by the coincidence degree of long-term memory and short-term memory, ideally, the results of long-term and short-term memory prediction should be completely consistent, but due to displacement deformation of a preset target object and the like, the difference between the long-term memory result and the short-term memory result is often large, and therefore the selection of a proper long-term reference frame is beneficial to improving the tracking quality. Selecting a reference frame set for inputting a target prediction module based on coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:

with the candidate long-term reference frames with the ratio value greater than 0.9 and less than 1.05 as the long-term reference frame group, with reference to fig. 3, the reference frame group is constructed as follows:

(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed by the reselected long-term reference frame group and the short-term reference frame group. In one embodiment, if IoU <0.6, traverse from frame 0, find reference frame with higher IoU value, and ensure IoU is closer to ratio, ideally, both should be the same and infinitely close to 1.0, if find the reference frame that meets the condition, reset the long-term reference frame and keep it unchanged in the next prediction, and add the short-term reference frame to make the final prediction; if no reference frame meeting the conditions is found, the target is changed greatly, similar frames are used as far as possible, and the frames are finally determined as a t-1 frame and a t-3 frame;

and after the long-term reference frame group is reselected, performing front background segmentation on each long-term reference frame according to long-term memory and short-term memory based on a grabcut method.

When the dynamic adjustment mechanism reselects the long-term reference frame, the selected result may have errors accumulated in the prediction process of the previous stage, which may cause the degradation of prediction quality, and even misjudge the background obviously not belonging to the preset target object as the preset target object. Therefore, after the dynamic adjustment mechanism reselects the long-term reference frame, foreground and background segmentation is required to be performed, labels which obviously do not belong to the preset target object are removed, and meanwhile, the predicted result can be smoothed, so that the edge of the predicted result is more adaptive to the preset target object.

The method adopts a grabcut method to segment the front background, and the grabcut method needs to set initial labels which are respectively a background, a foreground, a possible background and a possible foreground and are correspondingly represented by 0, 1, 2 and 3. In the operation process of the method, a color mode of a front background is generated according to an initially labeled foreground and a background, and then whether a possible background pixel point labeled as a possible foreground is a foreground or a background is judged according to the color mode.

Referring to fig. 4, the tag initialization method is as follows: marking the overlapped part in the long-term memory and the short-term memory as a foreground; marking the overlapped background part in the long-term and short-term memory as a background; objects that are foreground in long-term memory and background in short-term memory are labeled as possible backgrounds; the part of the short-term memory that is foreground and the part of the long-term memory that is background is marked as possible background.

Considering deformation and displacement of the preset target object, and the pixels near the foreground and the possible foreground, although marked as the background, may still be the pixels belonging to the preset target object, so the pixels marked as the foreground and the possible foreground are grown, and the pixels marked as the center point, the pixels in the square area with the side length of 15, and the pixels in the possible background are marked as the possible foreground. And after label initialization based on the grabcut method is completed, the grabcut method is operated, and the result directly replaces the original prediction result.

And S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model. In one embodiment, the area of the preset target object is represented by the number of tags belonging to the preset target object in the query frame, and the position of the preset target object is represented by the centroid of the preset target object in the query frame.

In one embodiment, the preset value of the rate of change of the number of tags is 10%.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A video coloring-based unsupervised target intensive tracking method is characterized in that a target tracking model is obtained according to the following steps S1-S5, and then the target tracking model is applied to complete the tracking of a preset target object;

S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T₂And T is₂>T₁；

2. The unsupervised object dense tracking method based on video coloring as claimed in claim 1, wherein the feature extraction network, the dynamic adjustment module and the object prediction module are connected in series in sequence, and the feature extraction network has a first period T₁Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T₁Receiving the characteristic diagram in real time and taking a second period T₂And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.

3. The unsupervised dense target tracking method based on video coloring as claimed in claim 1, wherein the feature extraction network, the dynamic adjustment module and the target prediction module are connected to each other two by two, and the feature extraction network has a first period T₁Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T₁Receiving the characteristic diagram in real time and taking a second period T₂Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T₁The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module₂Feature map set output to target prediction moduleAnd constructing a reference frame group as the input of the target prediction module.

4. The unsupervised object dense tracking method based on video rendering as claimed in claim 1, wherein in step S2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weight of each channel of the feature extraction network output feature map, which includes the following specific steps:

s21, the convolutional neural network takes a video sample frame as input and takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, W is the width of the feature map, and the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated to obtain a matrix with the dimensionality of C multiplied by 4;

and S22, performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.

5. The unsupervised object dense tracking method based on video coloring as claimed in claim 1, wherein the reference frame group includes a long-term reference frame group and a short-term reference frame group, and the step S4 specifically includes the following steps:

s41, taking the next frame of feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is the t frame, corresponding each pixel point of each feature map belonging to a preset target object to one label, comparing the label quantity of the feature map output by the feature extraction network real-time with the label quantity of the previous frame of feature map, if the change rate is more than or equal to the preset value, selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group, if the change rate of the label quantity is less than the preset value, selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group;

s42, searching pixel points belonging to a preset target object in the candidate pixel points in each candidate long-term reference frame by taking the pixel points at the preset positions in the query frame as query pixel points, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long term;

in the formula, C_t-1For querying the centroid, C, of the feature map of the frame preceding the frame_rThe centroid of the frame reference frame is H, the height of the feature map is H, the width of the feature map is W, and the pixel points in the range corresponding to the expansion rate are candidate pixel points;

s43, dividing square areas respectively by taking coordinates of all query pixel points in query frames corresponding to coordinates in all short-term reference frames as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in each square area, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;

and S44, selecting a reference frame group for inputting the target prediction module based on the coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:

(1) when IoU is greater than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;

(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed with the reselected long-term reference frame group and the short-term reference frame group.

6. The video coloring-based unsupervised dense target tracking method according to claim 2, wherein the preset value of the rate of change of the number of tags is 10%.

7. The unsupervised object dense tracking method based on video coloring as claimed in claim 2, wherein after the long-term reference frame group is reselected in step S44, the pre-background segmentation is performed on each long-term reference frame according to the long-term memory and the short-term memory based on the grabcut method.