CN114399531A - Unsupervised target dense tracking method based on video coloring - Google Patents

Unsupervised target dense tracking method based on video coloring Download PDF

Info

Publication number
CN114399531A
CN114399531A CN202111609449.4A CN202111609449A CN114399531A CN 114399531 A CN114399531 A CN 114399531A CN 202111609449 A CN202111609449 A CN 202111609449A CN 114399531 A CN114399531 A CN 114399531A
Authority
CN
China
Prior art keywords
feature
reference frame
feature map
taking
frame group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111609449.4A
Other languages
Chinese (zh)
Inventor
杜森
宋爱波
方效林
袁庆丰
杨明
朱同鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Chuangsiqi Technology Co ltd
Original Assignee
Nanjing Chuangsiqi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Chuangsiqi Technology Co ltd filed Critical Nanjing Chuangsiqi Technology Co ltd
Priority to CN202111609449.4A priority Critical patent/CN114399531A/en
Publication of CN114399531A publication Critical patent/CN114399531A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an unsupervised target intensive tracking method based on video coloring, which comprises the steps of constructing a target tracking model based on various video sample frames which are arranged in a time sequence and respectively comprise preset target objects, and then completing the tracking of the preset target objects by applying the target tracking model, wherein the target tracking model comprises a feature extraction network, a dynamic adjusting module and a target predicting module, wherein the feature extraction network is used for acquiring feature maps corresponding to the video sample frames, the dynamic adjusting module is used for outputting a feature map group formed by preset number of feature maps in all the obtained feature maps, and the target predicting module is used for predicting position parameters and area parameters of the preset target objects in a next frame of feature maps which output the feature maps in real time in the feature extraction network.

Description

Unsupervised target dense tracking method based on video coloring
Technical Field
The invention relates to a target dense tracking method, in particular to an unsupervised target dense tracking method based on video coloring.
Background
The problem of target tracking is a widely studied problem in computer vision, and the existing target tracking usually needs a large number of image tags and the tracked target types are limited, so the problem of unsupervised target tracking causes extensive attention and intensive research in academia. The existing unsupervised target tracking problem mainly comprises four types of solutions, which are respectively: corresponding stream based methods, time cycle consistency based methods, multiple mesh prediction filter based methods, and video shading based methods. The invention is based on a video rendering method, the general flow of which is: a certain frame in a given video is traversed through each pixel point p of the frameiFinding the point set { q } most similar to the point in a specific area in the reference frame or set of reference framesjW, the similarity degree of the point in the point set and the point is taken as the weight wijWeighted summation of the label values of points in a point set
Figure BDA0003432030820000011
Obtaining the predictive label value of each pixel point of the frame
Figure BDA0003432030820000012
The prediction accuracy of such methods generally does not exceed 40% (F)&J-mean index), the accuracy reaches about 65% through the improvement of MAST, the tracking effect is qualitatively improved, the highest level of the accuracy of the similar algorithm is achieved, but a certain difference exists between the similar algorithm and the supervised target tracking method, for example, the accuracy of the PReMPOS can reach 75% to 80%. Therefore, there is still a need and room for improvement in the unsupervised target tracking method.
Disclosure of Invention
The purpose of the invention is as follows: the unsupervised target intensive tracking method based on video coloring is provided, and unsupervised target intensive tracking with high prediction precision and high tracking quality is realized on a preset target object.
In order to realize the functions, the invention designs an unsupervised target intensive tracking method based on video coloring, which comprises the following steps of S1-S5, obtaining a target tracking model, and then completing the tracking of a preset target object by applying the target tracking model;
s1, obtaining video sample frames which are arranged in a time sequence and respectively comprise a preset target object;
s2, constructing a feature extraction network by taking the video sample frame as input and the feature map corresponding to the video sample frame as output based on the convolutional neural network and the SRM module, wherein the first period of the feature extraction network for outputting the feature map is T1
S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T2And T is2>T1
S4, constructing a reference frame group based on the feature graph output by the feature extraction network in real time and the feature graph group output by the dynamic adjustment module, and constructing a target prediction module by taking the reference frame group as input and taking the position parameter and the area parameter of a preset target object in the feature graph of the next frame of the feature graph output by the feature extraction network in real time as output;
and S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model.
As a preferred technical scheme of the invention: feature extraction network, dynamic tuningThe modules and the target prediction module are sequentially connected in series, and the feature extraction network adopts a first period T1Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T1Receiving the characteristic diagram in real time and taking a second period T2And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.
As a preferred technical scheme of the invention: the feature extraction network, the dynamic adjustment module and the target prediction module are connected with each other in pairs, and the feature extraction network adopts a first period T1Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T1Receiving the characteristic diagram in real time and taking a second period T2Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T1The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module2And the feature map groups output to the target prediction module together construct a reference frame group as the input of the target prediction module.
As a preferred technical scheme of the invention: in step S2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weight of each channel of the feature extraction network output feature map, which specifically includes the following steps:
s21: the convolutional neural network takes a video sample frame as input, takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, W is the width of the feature map, and the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated to obtain a matrix with the dimensionality of C multiplied by 4;
s22: and performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.
As a preferred technical scheme of the invention: the reference frame group includes a long-term reference frame group and a short-term reference frame group, and step S4 specifically includes the following steps:
s41: taking a next frame feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is a tth frame, corresponding one label to each pixel point belonging to a preset target object in each feature map, comparing the number of labels in the feature map output by the feature extraction network in real time with the number of labels in the previous frame feature map, and selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group if the change rate of the labels is greater than or equal to a preset value, and selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group if the change rate of the number of labels is less than the preset value;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-2 th and t-3 rd frames, taking the characteristic diagrams of the t-3 rd frame in the past preset number as candidate long-term reference frames along the historical time direction;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-3 th and t-5 th frames, taking the characteristic diagrams of the t-5 th frame in the past preset number as candidate long-term reference frames along the historical time direction;
s42: taking pixel points at preset positions in a query frame as query pixel points, searching pixel points belonging to a preset target object in candidate pixel points in each candidate long-term reference frame, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long time;
constructing a candidate pixel point based on an expansion rate dil, wherein the expansion rate dil is as follows:
Figure BDA0003432030820000031
in the formula, Ct-1For querying the centroid, C, of the feature map of the frame preceding the framerIs the centroid of the frame reference frame, H is the height of the feature map, and W is the feature mapWidth, the pixel points in the range corresponding to the expansion rate are candidate pixel points;
s43: respectively dividing square areas by taking coordinates of all query pixel points in a query frame, which correspond to all short-term reference frames, as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in all the square areas, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;
s44: selecting a reference frame set for inputting a target prediction module based on coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:
Figure BDA0003432030820000032
Figure BDA0003432030820000033
in the formula, l _ m is a pixel point marked as belonging to a preset target object in long-term memory, and s _ m is a pixel point marked as belonging to the preset target object in short-term memory;
taking candidate long-term reference frames with ratio values larger than 0.9 and smaller than 1.05 as a long-term reference frame group, and constructing the reference frame group according to the following method:
(1) when IoU is more than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;
(2) when the reference frame group is not less than 0.6 and not more than IoU and not more than 0.9, constructing the reference frame group by the long-term reference frame group and the short-term reference frame group;
(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed by the reselected long-term reference frame group and the short-term reference frame group.
As a preferred technical scheme of the invention: the preset value of the rate of change in the number of tags is 10%.
As a preferred technical scheme of the invention: after the long-term reference frame group is newly selected in step S44, the foreground and background segmentation is performed on each long-term reference frame according to the long-term memory and the short-term memory based on the grabcut method.
Has the advantages that: compared with the prior art, the invention has the advantages that:
the invention designs an unsupervised target dense tracking method based on video coloring, which integrates an SRM module into a residual error module of a feature extraction network, retrains the feature extraction network, and can enhance the capability of the network for extracting features; then, a dynamic reference frame adjusting mechanism and a foreground and background segmenting mechanism are combined, a proper reference frame is selected according to the result of the relevant parameters, the label is spread in the query frame, the tracking scene with the violent change of the target can be adapted, and the condition that the label is scattered on the background is reduced; overall, the model may improve the accuracy of target tracking in various scenarios.
Drawings
Fig. 1 is a schematic diagram of an SRM module provided according to an embodiment of the invention;
fig. 2 is a schematic diagram of a residual module provided in accordance with an embodiment of the invention in combination with an SRM module;
FIG. 3 is a flow chart of a mechanism for dynamically adjusting reference frames according to an embodiment of the invention;
fig. 4 is a schematic diagram of a foreground and background segmentation mechanism based on the grabcut method according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The invention provides a video coloring-based unsupervised target intensive tracking method, which comprises the following steps of S1-S5, obtaining a target tracking model, and then completing the tracking of a preset target object by applying the target tracking model;
s1, obtaining video sample frames which are arranged in a time sequence and respectively comprise a preset target object;
s2. based on convolution neural networkAnd the network and SRM module takes the video sample frame as input and takes the feature graph corresponding to the video sample frame as output to construct a feature extraction network, wherein the first period of the feature extraction network output feature graph is T1
Referring to fig. 1 and 2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weights of the channels of the feature extraction network output feature map, where the convolutional neural network may use ResNet-18 and ResNet-50, and its specific steps are as follows:
s21: the original convolution part of the convolution neural network is not changed, the convolution neural network takes a video sample frame as input and takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, and W is the width of the feature map, the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated, a matrix with the dimensionality of C multiplied by 4, called a style matrix, is obtained, and the value of the entropy of the feature map is calculated according to the following formula according to the channels of the feature map:
Figure BDA0003432030820000051
wherein χ is a set of intervals, and is composed of 20 small intervals with the range of-1 to 1 and the step length of 0.1. p (X) is the probability that the eigenvalue in X falls within the interval X, where 0 < p (X) < 1. The function of the standard deviation, mean and maximum is derivable, while the calculation function of the entropy is not derivable, so the derivative of the entropy needs to be customized:
Figure BDA0003432030820000053
since p (X) is a discrete function, it is heuristically set that when the feature vector X is uniformly distributed,
Figure BDA0003432030820000054
if p (x) is 0.05 and p' (x) is 0.025, the derivative of the entropy solving function is given by the following equation:
Figure BDA0003432030820000055
s22: and performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.
S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T2And T is2>T1
In one embodiment, the feature extraction network, the dynamic adjustment module and the target prediction module are connected in series in sequence, and the feature extraction network has a first period T1Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T1Receiving the characteristic diagram in real time and taking a second period T2And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.
In one embodiment, the feature extraction network, the dynamic adjustment module, and the target prediction module are connected to each other two by two, the feature extraction network having a first period T1Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T1Receiving the characteristic diagram in real time and taking a second period T2Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T1The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module2And the feature map groups output to the target prediction module together construct a reference frame group as the input of the target prediction module.
S4, constructing a reference frame group based on the feature graph output by the feature extraction network in real time and the feature graph group output by the dynamic adjustment module, and constructing a target prediction module by taking the reference frame group as input and taking the position parameter and the area parameter of a preset target object in the feature graph of the next frame of the feature graph output by the feature extraction network in real time as output;
the reference frame group includes a long-term reference frame group and a short-term reference frame group, and step S4 specifically includes the following steps:
s41: taking a next frame feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is a tth frame, corresponding one label to each pixel point belonging to a preset target object in each feature map, comparing the number of labels in the feature map output by the feature extraction network in real time with the number of labels in the previous frame feature map, and selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group if the change rate of the labels is greater than or equal to a preset value, and selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group if the change rate of the number of labels is less than the preset value;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-2 th and t-3 rd frames, taking the characteristic diagrams of the t-3 rd frame in the past preset number as candidate long-term reference frames along the historical time direction;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-3 th and t-5 th frames, taking the characteristic diagrams of the t-5 th frame in the past preset number as candidate long-term reference frames along the historical time direction;
s42: taking pixel points at preset positions in a query frame as query pixel points, searching pixel points belonging to a preset target object in candidate pixel points in each candidate long-term reference frame, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long time;
the position of a preset target object is represented by a mass center, the expansion rate of the preset target object in long-term memory is mainly influenced, the expansion rate represents the distance between horizontal and vertical coordinates of adjacent candidate pixel points, the larger the distance is, the larger the receptive field is, the whole reference frame is basically covered when the value is 4, the farther the position of the preset target object in a query frame deviates from the position of the preset target object in the query frame, the larger the expansion rate is, the candidate pixel points are constructed based on the expansion rate dil, and the expansion rate dil is as follows:
Figure BDA0003432030820000071
in the formula, Ct-1Center of mass, C, of feature map of frame before query frame for preset target objectrThe centroid of a preset target object in the frame reference frame, H is the height of the feature map, W is the width of the feature map, and pixel points in the range corresponding to the expansion rate are candidate pixel points;
s43: respectively dividing square areas by taking coordinates of all query pixel points in a query frame, which correspond to all short-term reference frames, as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in all the square areas, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;
in the experimental process, C in the result is predicted according to long-term memoryt-1And CrCloser in position relative to the true center of mass, since the target is from CrMove to Ct-1When the label is scattered on the background, if dil is 0, the label is not different from the prediction process of the short-term memory, so dil is 1 at the minimum, and then the problem of centroid shift is considered, so that 2 ≦ dil ≦ 4 is set in one embodiment, and the specific value is determined by the ratio of the centroid shift to the maximum picture offset.
S44: the similarity degree of the long-term reference frame and the query frame is measured by the coincidence degree of long-term memory and short-term memory, ideally, the results of long-term and short-term memory prediction should be completely consistent, but due to displacement deformation of a preset target object and the like, the difference between the long-term memory result and the short-term memory result is often large, and therefore the selection of a proper long-term reference frame is beneficial to improving the tracking quality. Selecting a reference frame set for inputting a target prediction module based on coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:
Figure BDA0003432030820000072
Figure BDA0003432030820000081
in the formula, l _ m is a pixel point marked as belonging to a preset target object in long-term memory, and s _ m is a pixel point marked as belonging to the preset target object in short-term memory;
with the candidate long-term reference frames with the ratio value greater than 0.9 and less than 1.05 as the long-term reference frame group, with reference to fig. 3, the reference frame group is constructed as follows:
(1) when IoU is more than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;
(2) when the reference frame group is not less than 0.6 and not more than IoU and not more than 0.9, constructing the reference frame group by the long-term reference frame group and the short-term reference frame group;
(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed by the reselected long-term reference frame group and the short-term reference frame group. In one embodiment, if IoU <0.6, traverse from frame 0, find reference frame with higher IoU value, and ensure IoU is closer to ratio, ideally, both should be the same and infinitely close to 1.0, if find the reference frame that meets the condition, reset the long-term reference frame and keep it unchanged in the next prediction, and add the short-term reference frame to make the final prediction; if no reference frame meeting the conditions is found, the target is changed greatly, similar frames are used as far as possible, and the frames are finally determined as a t-1 frame and a t-3 frame;
and after the long-term reference frame group is reselected, performing front background segmentation on each long-term reference frame according to long-term memory and short-term memory based on a grabcut method.
When the dynamic adjustment mechanism reselects the long-term reference frame, the selected result may have errors accumulated in the prediction process of the previous stage, which may cause the degradation of prediction quality, and even misjudge the background obviously not belonging to the preset target object as the preset target object. Therefore, after the dynamic adjustment mechanism reselects the long-term reference frame, foreground and background segmentation is required to be performed, labels which obviously do not belong to the preset target object are removed, and meanwhile, the predicted result can be smoothed, so that the edge of the predicted result is more adaptive to the preset target object.
The method adopts a grabcut method to segment the front background, and the grabcut method needs to set initial labels which are respectively a background, a foreground, a possible background and a possible foreground and are correspondingly represented by 0, 1, 2 and 3. In the operation process of the method, a color mode of a front background is generated according to an initially labeled foreground and a background, and then whether a possible background pixel point labeled as a possible foreground is a foreground or a background is judged according to the color mode.
Referring to fig. 4, the tag initialization method is as follows: marking the overlapped part in the long-term memory and the short-term memory as a foreground; marking the overlapped background part in the long-term and short-term memory as a background; objects that are foreground in long-term memory and background in short-term memory are labeled as possible backgrounds; the part of the short-term memory that is foreground and the part of the long-term memory that is background is marked as possible background.
Considering deformation and displacement of the preset target object, and the pixels near the foreground and the possible foreground, although marked as the background, may still be the pixels belonging to the preset target object, so the pixels marked as the foreground and the possible foreground are grown, and the pixels marked as the center point, the pixels in the square area with the side length of 15, and the pixels in the possible background are marked as the possible foreground. And after label initialization based on the grabcut method is completed, the grabcut method is operated, and the result directly replaces the original prediction result.
And S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model. In one embodiment, the area of the preset target object is represented by the number of tags belonging to the preset target object in the query frame, and the position of the preset target object is represented by the centroid of the preset target object in the query frame.
In one embodiment, the preset value of the rate of change of the number of tags is 10%.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (7)

1. A video coloring-based unsupervised target intensive tracking method is characterized in that a target tracking model is obtained according to the following steps S1-S5, and then the target tracking model is applied to complete the tracking of a preset target object;
s1, obtaining video sample frames which are arranged in a time sequence and respectively comprise a preset target object;
s2, constructing a feature extraction network by taking the video sample frame as input and the feature map corresponding to the video sample frame as output based on the convolutional neural network and the SRM module, wherein the first period of the feature extraction network for outputting the feature map is T1
S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T2And T is2>T1
S4, constructing a reference frame group based on the feature graph output by the feature extraction network in real time and the feature graph group output by the dynamic adjustment module, and constructing a target prediction module by taking the reference frame group as input and taking the position parameter and the area parameter of a preset target object in the feature graph of the next frame of the feature graph output by the feature extraction network in real time as output;
and S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model.
2. The unsupervised object dense tracking method based on video coloring as claimed in claim 1, wherein the feature extraction network, the dynamic adjustment module and the object prediction module are connected in series in sequence, and the feature extraction network has a first period T1Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T1Receiving the characteristic diagram in real time and taking a second period T2And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.
3. The unsupervised dense target tracking method based on video coloring as claimed in claim 1, wherein the feature extraction network, the dynamic adjustment module and the target prediction module are connected to each other two by two, and the feature extraction network has a first period T1Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T1Receiving the characteristic diagram in real time and taking a second period T2Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T1The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module2Feature map set output to target prediction moduleAnd constructing a reference frame group as the input of the target prediction module.
4. The unsupervised object dense tracking method based on video rendering as claimed in claim 1, wherein in step S2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weight of each channel of the feature extraction network output feature map, which includes the following specific steps:
s21, the convolutional neural network takes a video sample frame as input and takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, W is the width of the feature map, and the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated to obtain a matrix with the dimensionality of C multiplied by 4;
and S22, performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.
5. The unsupervised object dense tracking method based on video coloring as claimed in claim 1, wherein the reference frame group includes a long-term reference frame group and a short-term reference frame group, and the step S4 specifically includes the following steps:
s41, taking the next frame of feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is the t frame, corresponding each pixel point of each feature map belonging to a preset target object to one label, comparing the label quantity of the feature map output by the feature extraction network real-time with the label quantity of the previous frame of feature map, if the change rate is more than or equal to the preset value, selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group, if the change rate of the label quantity is less than the preset value, selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-2 th and t-3 rd frames, taking the characteristic diagrams of the t-3 rd frame in the past preset number as candidate long-term reference frames along the historical time direction;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-3 th and t-5 th frames, taking the characteristic diagrams of the t-5 th frame in the past preset number as candidate long-term reference frames along the historical time direction;
s42, searching pixel points belonging to a preset target object in the candidate pixel points in each candidate long-term reference frame by taking the pixel points at the preset positions in the query frame as query pixel points, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long term;
constructing a candidate pixel point based on an expansion rate dil, wherein the expansion rate dil is as follows:
Figure FDA0003432030810000021
in the formula, Ct-1For querying the centroid, C, of the feature map of the frame preceding the framerThe centroid of the frame reference frame is H, the height of the feature map is H, the width of the feature map is W, and the pixel points in the range corresponding to the expansion rate are candidate pixel points;
s43, dividing square areas respectively by taking coordinates of all query pixel points in query frames corresponding to coordinates in all short-term reference frames as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in each square area, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;
and S44, selecting a reference frame group for inputting the target prediction module based on the coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:
Figure FDA0003432030810000031
Figure FDA0003432030810000032
in the formula, l _ m is a pixel point marked as belonging to a preset target object in long-term memory, and s _ m is a pixel point marked as belonging to the preset target object in short-term memory;
taking candidate long-term reference frames with ratio values larger than 0.9 and smaller than 1.05 as a long-term reference frame group, and constructing the reference frame group according to the following method:
(1) when IoU is greater than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;
(2) when the reference frame group is not less than 0.6 and not more than IoU and not more than 0.9, constructing the reference frame group by the long-term reference frame group and the short-term reference frame group;
(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed with the reselected long-term reference frame group and the short-term reference frame group.
6. The video coloring-based unsupervised dense target tracking method according to claim 2, wherein the preset value of the rate of change of the number of tags is 10%.
7. The unsupervised object dense tracking method based on video coloring as claimed in claim 2, wherein after the long-term reference frame group is reselected in step S44, the pre-background segmentation is performed on each long-term reference frame according to the long-term memory and the short-term memory based on the grabcut method.
CN202111609449.4A 2021-12-24 2021-12-24 Unsupervised target dense tracking method based on video coloring Pending CN114399531A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111609449.4A CN114399531A (en) 2021-12-24 2021-12-24 Unsupervised target dense tracking method based on video coloring

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111609449.4A CN114399531A (en) 2021-12-24 2021-12-24 Unsupervised target dense tracking method based on video coloring

Publications (1)

Publication Number Publication Date
CN114399531A true CN114399531A (en) 2022-04-26

Family

ID=81226420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111609449.4A Pending CN114399531A (en) 2021-12-24 2021-12-24 Unsupervised target dense tracking method based on video coloring

Country Status (1)

Country Link
CN (1) CN114399531A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6335985B1 (en) * 1998-01-07 2002-01-01 Kabushiki Kaisha Toshiba Object extraction apparatus
US6774917B1 (en) * 1999-03-11 2004-08-10 Fuji Xerox Co., Ltd. Methods and apparatuses for interactive similarity searching, retrieval, and browsing of video
CN109086709A (en) * 2018-07-27 2018-12-25 腾讯科技(深圳)有限公司 Feature Selection Model training method, device and storage medium
CN111090756A (en) * 2020-03-24 2020-05-01 腾讯科技(深圳)有限公司 Artificial intelligence-based multi-target recommendation model training method and device
CN112950675A (en) * 2021-03-18 2021-06-11 深圳市商汤科技有限公司 Target tracking method and device, electronic equipment and storage medium
CN113344976A (en) * 2021-06-29 2021-09-03 常州工学院 Visual tracking method based on target object characterization point estimation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6335985B1 (en) * 1998-01-07 2002-01-01 Kabushiki Kaisha Toshiba Object extraction apparatus
US6774917B1 (en) * 1999-03-11 2004-08-10 Fuji Xerox Co., Ltd. Methods and apparatuses for interactive similarity searching, retrieval, and browsing of video
CN109086709A (en) * 2018-07-27 2018-12-25 腾讯科技(深圳)有限公司 Feature Selection Model training method, device and storage medium
CN111090756A (en) * 2020-03-24 2020-05-01 腾讯科技(深圳)有限公司 Artificial intelligence-based multi-target recommendation model training method and device
CN112950675A (en) * 2021-03-18 2021-06-11 深圳市商汤科技有限公司 Target tracking method and device, electronic equipment and storage medium
CN113344976A (en) * 2021-06-29 2021-09-03 常州工学院 Visual tracking method based on target object characterization point estimation

Similar Documents

Publication Publication Date Title
CN109241913B (en) Ship detection method and system combining significance detection and deep learning
CN112836640B (en) Single-camera multi-target pedestrian tracking method
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN110781756A (en) Urban road extraction method and device based on remote sensing image
CN112884742B (en) Multi-target real-time detection, identification and tracking method based on multi-algorithm fusion
CN112180375B (en) Weather radar echo extrapolation method based on improved TrajGRU network
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN113592911B (en) Apparent enhanced depth target tracking method
CN109993052B (en) Scale-adaptive target tracking method and system under complex scene
CN112560619B (en) Multi-focus image fusion-based multi-distance bird accurate identification method
CN111242026B (en) Remote sensing image target detection method based on spatial hierarchy perception module and metric learning
CN115937254B (en) Multi-aerial flying target tracking method and system based on semi-supervised learning
CN111383250A (en) Moving target detection method and device based on improved Gaussian mixture model
CN111161309A (en) Searching and positioning method for vehicle-mounted video dynamic target
CN113052170A (en) Small target license plate recognition method under unconstrained scene
CN113486894A (en) Semantic segmentation method for satellite image feature component
CN117036397A (en) Multi-target tracking method based on fusion information association and camera motion compensation
CN114973071A (en) Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics
CN110276782B (en) Hyperspectral target tracking method combining spatial spectral features and related filtering
CN113627481A (en) Multi-model combined unmanned aerial vehicle garbage classification method for smart gardens
CN111161323B (en) Complex scene target tracking method and system based on correlation filtering
CN116363557B (en) Self-learning labeling method, system and medium for continuous frames
CN117011655A (en) Adaptive region selection feature fusion based method, target tracking method and system
CN115861384A (en) Optical flow estimation method and system based on generation of countermeasure and attention mechanism
CN114399531A (en) Unsupervised target dense tracking method based on video coloring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination