CN114399531A - Unsupervised target dense tracking method based on video coloring - Google Patents
Unsupervised target dense tracking method based on video coloring Download PDFInfo
- Publication number
- CN114399531A CN114399531A CN202111609449.4A CN202111609449A CN114399531A CN 114399531 A CN114399531 A CN 114399531A CN 202111609449 A CN202111609449 A CN 202111609449A CN 114399531 A CN114399531 A CN 114399531A
- Authority
- CN
- China
- Prior art keywords
- feature
- reference frame
- feature map
- taking
- frame group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000004040 coloring Methods 0.000 title claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 56
- 230000007774 longterm Effects 0.000 claims description 43
- 238000010586 diagram Methods 0.000 claims description 30
- 230000006403 short-term memory Effects 0.000 claims description 20
- 230000007787 long-term memory Effects 0.000 claims description 17
- 230000008859 change Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 5
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 3
- 238000009877 rendering Methods 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an unsupervised target intensive tracking method based on video coloring, which comprises the steps of constructing a target tracking model based on various video sample frames which are arranged in a time sequence and respectively comprise preset target objects, and then completing the tracking of the preset target objects by applying the target tracking model, wherein the target tracking model comprises a feature extraction network, a dynamic adjusting module and a target predicting module, wherein the feature extraction network is used for acquiring feature maps corresponding to the video sample frames, the dynamic adjusting module is used for outputting a feature map group formed by preset number of feature maps in all the obtained feature maps, and the target predicting module is used for predicting position parameters and area parameters of the preset target objects in a next frame of feature maps which output the feature maps in real time in the feature extraction network.
Description
Technical Field
The invention relates to a target dense tracking method, in particular to an unsupervised target dense tracking method based on video coloring.
Background
The problem of target tracking is a widely studied problem in computer vision, and the existing target tracking usually needs a large number of image tags and the tracked target types are limited, so the problem of unsupervised target tracking causes extensive attention and intensive research in academia. The existing unsupervised target tracking problem mainly comprises four types of solutions, which are respectively: corresponding stream based methods, time cycle consistency based methods, multiple mesh prediction filter based methods, and video shading based methods. The invention is based on a video rendering method, the general flow of which is: a certain frame in a given video is traversed through each pixel point p of the frameiFinding the point set { q } most similar to the point in a specific area in the reference frame or set of reference framesjW, the similarity degree of the point in the point set and the point is taken as the weight wijWeighted summation of the label values of points in a point setObtaining the predictive label value of each pixel point of the frameThe prediction accuracy of such methods generally does not exceed 40% (F)&J-mean index), the accuracy reaches about 65% through the improvement of MAST, the tracking effect is qualitatively improved, the highest level of the accuracy of the similar algorithm is achieved, but a certain difference exists between the similar algorithm and the supervised target tracking method, for example, the accuracy of the PReMPOS can reach 75% to 80%. Therefore, there is still a need and room for improvement in the unsupervised target tracking method.
Disclosure of Invention
The purpose of the invention is as follows: the unsupervised target intensive tracking method based on video coloring is provided, and unsupervised target intensive tracking with high prediction precision and high tracking quality is realized on a preset target object.
In order to realize the functions, the invention designs an unsupervised target intensive tracking method based on video coloring, which comprises the following steps of S1-S5, obtaining a target tracking model, and then completing the tracking of a preset target object by applying the target tracking model;
s1, obtaining video sample frames which are arranged in a time sequence and respectively comprise a preset target object;
s2, constructing a feature extraction network by taking the video sample frame as input and the feature map corresponding to the video sample frame as output based on the convolutional neural network and the SRM module, wherein the first period of the feature extraction network for outputting the feature map is T1;
S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T2And T is2>T1;
S4, constructing a reference frame group based on the feature graph output by the feature extraction network in real time and the feature graph group output by the dynamic adjustment module, and constructing a target prediction module by taking the reference frame group as input and taking the position parameter and the area parameter of a preset target object in the feature graph of the next frame of the feature graph output by the feature extraction network in real time as output;
and S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model.
As a preferred technical scheme of the invention: feature extraction network, dynamic tuningThe modules and the target prediction module are sequentially connected in series, and the feature extraction network adopts a first period T1Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T1Receiving the characteristic diagram in real time and taking a second period T2And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.
As a preferred technical scheme of the invention: the feature extraction network, the dynamic adjustment module and the target prediction module are connected with each other in pairs, and the feature extraction network adopts a first period T1Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T1Receiving the characteristic diagram in real time and taking a second period T2Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T1The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module2And the feature map groups output to the target prediction module together construct a reference frame group as the input of the target prediction module.
As a preferred technical scheme of the invention: in step S2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weight of each channel of the feature extraction network output feature map, which specifically includes the following steps:
s21: the convolutional neural network takes a video sample frame as input, takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, W is the width of the feature map, and the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated to obtain a matrix with the dimensionality of C multiplied by 4;
s22: and performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.
As a preferred technical scheme of the invention: the reference frame group includes a long-term reference frame group and a short-term reference frame group, and step S4 specifically includes the following steps:
s41: taking a next frame feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is a tth frame, corresponding one label to each pixel point belonging to a preset target object in each feature map, comparing the number of labels in the feature map output by the feature extraction network in real time with the number of labels in the previous frame feature map, and selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group if the change rate of the labels is greater than or equal to a preset value, and selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group if the change rate of the number of labels is less than the preset value;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-2 th and t-3 rd frames, taking the characteristic diagrams of the t-3 rd frame in the past preset number as candidate long-term reference frames along the historical time direction;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-3 th and t-5 th frames, taking the characteristic diagrams of the t-5 th frame in the past preset number as candidate long-term reference frames along the historical time direction;
s42: taking pixel points at preset positions in a query frame as query pixel points, searching pixel points belonging to a preset target object in candidate pixel points in each candidate long-term reference frame, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long time;
constructing a candidate pixel point based on an expansion rate dil, wherein the expansion rate dil is as follows:
in the formula, Ct-1For querying the centroid, C, of the feature map of the frame preceding the framerIs the centroid of the frame reference frame, H is the height of the feature map, and W is the feature mapWidth, the pixel points in the range corresponding to the expansion rate are candidate pixel points;
s43: respectively dividing square areas by taking coordinates of all query pixel points in a query frame, which correspond to all short-term reference frames, as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in all the square areas, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;
s44: selecting a reference frame set for inputting a target prediction module based on coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:
in the formula, l _ m is a pixel point marked as belonging to a preset target object in long-term memory, and s _ m is a pixel point marked as belonging to the preset target object in short-term memory;
taking candidate long-term reference frames with ratio values larger than 0.9 and smaller than 1.05 as a long-term reference frame group, and constructing the reference frame group according to the following method:
(1) when IoU is more than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;
(2) when the reference frame group is not less than 0.6 and not more than IoU and not more than 0.9, constructing the reference frame group by the long-term reference frame group and the short-term reference frame group;
(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed by the reselected long-term reference frame group and the short-term reference frame group.
As a preferred technical scheme of the invention: the preset value of the rate of change in the number of tags is 10%.
As a preferred technical scheme of the invention: after the long-term reference frame group is newly selected in step S44, the foreground and background segmentation is performed on each long-term reference frame according to the long-term memory and the short-term memory based on the grabcut method.
Has the advantages that: compared with the prior art, the invention has the advantages that:
the invention designs an unsupervised target dense tracking method based on video coloring, which integrates an SRM module into a residual error module of a feature extraction network, retrains the feature extraction network, and can enhance the capability of the network for extracting features; then, a dynamic reference frame adjusting mechanism and a foreground and background segmenting mechanism are combined, a proper reference frame is selected according to the result of the relevant parameters, the label is spread in the query frame, the tracking scene with the violent change of the target can be adapted, and the condition that the label is scattered on the background is reduced; overall, the model may improve the accuracy of target tracking in various scenarios.
Drawings
Fig. 1 is a schematic diagram of an SRM module provided according to an embodiment of the invention;
fig. 2 is a schematic diagram of a residual module provided in accordance with an embodiment of the invention in combination with an SRM module;
FIG. 3 is a flow chart of a mechanism for dynamically adjusting reference frames according to an embodiment of the invention;
fig. 4 is a schematic diagram of a foreground and background segmentation mechanism based on the grabcut method according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The invention provides a video coloring-based unsupervised target intensive tracking method, which comprises the following steps of S1-S5, obtaining a target tracking model, and then completing the tracking of a preset target object by applying the target tracking model;
s1, obtaining video sample frames which are arranged in a time sequence and respectively comprise a preset target object;
s2. based on convolution neural networkAnd the network and SRM module takes the video sample frame as input and takes the feature graph corresponding to the video sample frame as output to construct a feature extraction network, wherein the first period of the feature extraction network output feature graph is T1;
Referring to fig. 1 and 2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weights of the channels of the feature extraction network output feature map, where the convolutional neural network may use ResNet-18 and ResNet-50, and its specific steps are as follows:
s21: the original convolution part of the convolution neural network is not changed, the convolution neural network takes a video sample frame as input and takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, and W is the width of the feature map, the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated, a matrix with the dimensionality of C multiplied by 4, called a style matrix, is obtained, and the value of the entropy of the feature map is calculated according to the following formula according to the channels of the feature map:
wherein χ is a set of intervals, and is composed of 20 small intervals with the range of-1 to 1 and the step length of 0.1. p (X) is the probability that the eigenvalue in X falls within the interval X, where 0 < p (X) < 1. The function of the standard deviation, mean and maximum is derivable, while the calculation function of the entropy is not derivable, so the derivative of the entropy needs to be customized:
since p (X) is a discrete function, it is heuristically set that when the feature vector X is uniformly distributed,if p (x) is 0.05 and p' (x) is 0.025, the derivative of the entropy solving function is given by the following equation:
s22: and performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.
S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T2And T is2>T1;
In one embodiment, the feature extraction network, the dynamic adjustment module and the target prediction module are connected in series in sequence, and the feature extraction network has a first period T1Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T1Receiving the characteristic diagram in real time and taking a second period T2And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.
In one embodiment, the feature extraction network, the dynamic adjustment module, and the target prediction module are connected to each other two by two, the feature extraction network having a first period T1Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T1Receiving the characteristic diagram in real time and taking a second period T2Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T1The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module2And the feature map groups output to the target prediction module together construct a reference frame group as the input of the target prediction module.
S4, constructing a reference frame group based on the feature graph output by the feature extraction network in real time and the feature graph group output by the dynamic adjustment module, and constructing a target prediction module by taking the reference frame group as input and taking the position parameter and the area parameter of a preset target object in the feature graph of the next frame of the feature graph output by the feature extraction network in real time as output;
the reference frame group includes a long-term reference frame group and a short-term reference frame group, and step S4 specifically includes the following steps:
s41: taking a next frame feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is a tth frame, corresponding one label to each pixel point belonging to a preset target object in each feature map, comparing the number of labels in the feature map output by the feature extraction network in real time with the number of labels in the previous frame feature map, and selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group if the change rate of the labels is greater than or equal to a preset value, and selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group if the change rate of the number of labels is less than the preset value;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-2 th and t-3 rd frames, taking the characteristic diagrams of the t-3 rd frame in the past preset number as candidate long-term reference frames along the historical time direction;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-3 th and t-5 th frames, taking the characteristic diagrams of the t-5 th frame in the past preset number as candidate long-term reference frames along the historical time direction;
s42: taking pixel points at preset positions in a query frame as query pixel points, searching pixel points belonging to a preset target object in candidate pixel points in each candidate long-term reference frame, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long time;
the position of a preset target object is represented by a mass center, the expansion rate of the preset target object in long-term memory is mainly influenced, the expansion rate represents the distance between horizontal and vertical coordinates of adjacent candidate pixel points, the larger the distance is, the larger the receptive field is, the whole reference frame is basically covered when the value is 4, the farther the position of the preset target object in a query frame deviates from the position of the preset target object in the query frame, the larger the expansion rate is, the candidate pixel points are constructed based on the expansion rate dil, and the expansion rate dil is as follows:
in the formula, Ct-1Center of mass, C, of feature map of frame before query frame for preset target objectrThe centroid of a preset target object in the frame reference frame, H is the height of the feature map, W is the width of the feature map, and pixel points in the range corresponding to the expansion rate are candidate pixel points;
s43: respectively dividing square areas by taking coordinates of all query pixel points in a query frame, which correspond to all short-term reference frames, as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in all the square areas, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;
in the experimental process, C in the result is predicted according to long-term memoryt-1And CrCloser in position relative to the true center of mass, since the target is from CrMove to Ct-1When the label is scattered on the background, if dil is 0, the label is not different from the prediction process of the short-term memory, so dil is 1 at the minimum, and then the problem of centroid shift is considered, so that 2 ≦ dil ≦ 4 is set in one embodiment, and the specific value is determined by the ratio of the centroid shift to the maximum picture offset.
S44: the similarity degree of the long-term reference frame and the query frame is measured by the coincidence degree of long-term memory and short-term memory, ideally, the results of long-term and short-term memory prediction should be completely consistent, but due to displacement deformation of a preset target object and the like, the difference between the long-term memory result and the short-term memory result is often large, and therefore the selection of a proper long-term reference frame is beneficial to improving the tracking quality. Selecting a reference frame set for inputting a target prediction module based on coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:
in the formula, l _ m is a pixel point marked as belonging to a preset target object in long-term memory, and s _ m is a pixel point marked as belonging to the preset target object in short-term memory;
with the candidate long-term reference frames with the ratio value greater than 0.9 and less than 1.05 as the long-term reference frame group, with reference to fig. 3, the reference frame group is constructed as follows:
(1) when IoU is more than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;
(2) when the reference frame group is not less than 0.6 and not more than IoU and not more than 0.9, constructing the reference frame group by the long-term reference frame group and the short-term reference frame group;
(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed by the reselected long-term reference frame group and the short-term reference frame group. In one embodiment, if IoU <0.6, traverse from frame 0, find reference frame with higher IoU value, and ensure IoU is closer to ratio, ideally, both should be the same and infinitely close to 1.0, if find the reference frame that meets the condition, reset the long-term reference frame and keep it unchanged in the next prediction, and add the short-term reference frame to make the final prediction; if no reference frame meeting the conditions is found, the target is changed greatly, similar frames are used as far as possible, and the frames are finally determined as a t-1 frame and a t-3 frame;
and after the long-term reference frame group is reselected, performing front background segmentation on each long-term reference frame according to long-term memory and short-term memory based on a grabcut method.
When the dynamic adjustment mechanism reselects the long-term reference frame, the selected result may have errors accumulated in the prediction process of the previous stage, which may cause the degradation of prediction quality, and even misjudge the background obviously not belonging to the preset target object as the preset target object. Therefore, after the dynamic adjustment mechanism reselects the long-term reference frame, foreground and background segmentation is required to be performed, labels which obviously do not belong to the preset target object are removed, and meanwhile, the predicted result can be smoothed, so that the edge of the predicted result is more adaptive to the preset target object.
The method adopts a grabcut method to segment the front background, and the grabcut method needs to set initial labels which are respectively a background, a foreground, a possible background and a possible foreground and are correspondingly represented by 0, 1, 2 and 3. In the operation process of the method, a color mode of a front background is generated according to an initially labeled foreground and a background, and then whether a possible background pixel point labeled as a possible foreground is a foreground or a background is judged according to the color mode.
Referring to fig. 4, the tag initialization method is as follows: marking the overlapped part in the long-term memory and the short-term memory as a foreground; marking the overlapped background part in the long-term and short-term memory as a background; objects that are foreground in long-term memory and background in short-term memory are labeled as possible backgrounds; the part of the short-term memory that is foreground and the part of the long-term memory that is background is marked as possible background.
Considering deformation and displacement of the preset target object, and the pixels near the foreground and the possible foreground, although marked as the background, may still be the pixels belonging to the preset target object, so the pixels marked as the foreground and the possible foreground are grown, and the pixels marked as the center point, the pixels in the square area with the side length of 15, and the pixels in the possible background are marked as the possible foreground. And after label initialization based on the grabcut method is completed, the grabcut method is operated, and the result directly replaces the original prediction result.
And S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model. In one embodiment, the area of the preset target object is represented by the number of tags belonging to the preset target object in the query frame, and the position of the preset target object is represented by the centroid of the preset target object in the query frame.
In one embodiment, the preset value of the rate of change of the number of tags is 10%.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.
Claims (7)
1. A video coloring-based unsupervised target intensive tracking method is characterized in that a target tracking model is obtained according to the following steps S1-S5, and then the target tracking model is applied to complete the tracking of a preset target object;
s1, obtaining video sample frames which are arranged in a time sequence and respectively comprise a preset target object;
s2, constructing a feature extraction network by taking the video sample frame as input and the feature map corresponding to the video sample frame as output based on the convolutional neural network and the SRM module, wherein the first period of the feature extraction network for outputting the feature map is T1;
S3, based on the mode of outputting the feature graph in real time by the feature extraction network, taking the feature graph output in real time by the feature extraction network as real-time input, and taking a feature graph group formed by a second period aiming at the preset number of feature graphs in all the obtained feature graphs as output, constructing a dynamic adjustment module, wherein the second period of the feature graph group output by the dynamic adjustment module is T2And T is2>T1;
S4, constructing a reference frame group based on the feature graph output by the feature extraction network in real time and the feature graph group output by the dynamic adjustment module, and constructing a target prediction module by taking the reference frame group as input and taking the position parameter and the area parameter of a preset target object in the feature graph of the next frame of the feature graph output by the feature extraction network in real time as output;
and S5, based on the feature extraction network, the dynamic adjustment module and the target prediction module, taking the video sample frame as input, and taking the position parameter and the area parameter of the preset target object in the feature map of the next frame of the feature map output in real time in the feature extraction network as output to construct a target tracking model.
2. The unsupervised object dense tracking method based on video coloring as claimed in claim 1, wherein the feature extraction network, the dynamic adjustment module and the object prediction module are connected in series in sequence, and the feature extraction network has a first period T1Outputting the characteristic diagram to a dynamic adjustment module in real time, wherein the dynamic adjustment module uses a first period T1Receiving the characteristic diagram in real time and taking a second period T2And outputting a feature map group formed by a preset number of feature maps in all the obtained feature maps to a target prediction module, constructing a reference frame group by using the feature map group output by the dynamic adjustment module, and taking the reference frame group as the input of the target prediction module, wherein the reference frame group comprises the latest frame feature map output by the feature extraction network received by the dynamic adjustment module in real time.
3. The unsupervised dense target tracking method based on video coloring as claimed in claim 1, wherein the feature extraction network, the dynamic adjustment module and the target prediction module are connected to each other two by two, and the feature extraction network has a first period T1Simultaneously outputting a characteristic diagram to a dynamic adjusting module and a target predicting module in real time, wherein the dynamic adjusting module uses T1Receiving the characteristic diagram in real time and taking a second period T2Outputting a feature map group consisting of a preset number of feature maps in all the obtained feature maps to a target prediction module, and extracting features from a network in a first period T1The latest frame characteristic diagram output to the target prediction module in real time, and the second period T of the dynamic adjustment module2Feature map set output to target prediction moduleAnd constructing a reference frame group as the input of the target prediction module.
4. The unsupervised object dense tracking method based on video rendering as claimed in claim 1, wherein in step S2, the SRM module is combined with each residual block of the convolutional neural network to adjust the weight of each channel of the feature extraction network output feature map, which includes the following specific steps:
s21, the convolutional neural network takes a video sample frame as input and takes a feature map with the dimensionality of C multiplied by H multiplied by W as output, wherein C is the number of channels of the feature map, H is the length of the feature map, W is the width of the feature map, and the standard deviation, the average value, the maximum value and the entropy of each channel of the feature map are calculated to obtain a matrix with the dimensionality of C multiplied by 4;
and S22, performing 1 × 1 convolution on the matrix with the dimension of C × 4 to obtain a C × 1 weight vector, adjusting the weight of each channel of the feature map according to the weight vector, and taking the adjusted feature map as the output of the feature extraction network.
5. The unsupervised object dense tracking method based on video coloring as claimed in claim 1, wherein the reference frame group includes a long-term reference frame group and a short-term reference frame group, and the step S4 specifically includes the following steps:
s41, taking the next frame of feature map of the feature extraction network real-time output feature map as a query frame, wherein the frame number is the t frame, corresponding each pixel point of each feature map belonging to a preset target object to one label, comparing the label quantity of the feature map output by the feature extraction network real-time with the label quantity of the previous frame of feature map, if the change rate is more than or equal to the preset value, selecting the t-1, t-2 and t-3 frame feature maps to construct a short-term reference frame group, if the change rate of the label quantity is less than the preset value, selecting the t-1, t-3 and t-5 frame feature maps to construct a short-term reference frame group;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-2 th and t-3 rd frames, taking the characteristic diagrams of the t-3 rd frame in the past preset number as candidate long-term reference frames along the historical time direction;
when the short-term reference frame group is the characteristic diagrams of the t-1 th, t-3 th and t-5 th frames, taking the characteristic diagrams of the t-5 th frame in the past preset number as candidate long-term reference frames along the historical time direction;
s42, searching pixel points belonging to a preset target object in the candidate pixel points in each candidate long-term reference frame by taking the pixel points at the preset positions in the query frame as query pixel points, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and memorizing the result obtained in the process for a long term;
constructing a candidate pixel point based on an expansion rate dil, wherein the expansion rate dil is as follows:
in the formula, Ct-1For querying the centroid, C, of the feature map of the frame preceding the framerThe centroid of the frame reference frame is H, the height of the feature map is H, the width of the feature map is W, and the pixel points in the range corresponding to the expansion rate are candidate pixel points;
s43, dividing square areas respectively by taking coordinates of all query pixel points in query frames corresponding to coordinates in all short-term reference frames as centers and taking 15 pixel points as side lengths, searching pixel points belonging to a preset target object in each square area, marking the pixel points with the same label in the query pixel points as the pixel points belonging to the preset target object according to the corresponding label, and taking the result obtained in the process as short-term memory;
and S44, selecting a reference frame group for inputting the target prediction module based on the coincidence degree parameters IoU and ratio of the long-term memory and the short-term memory, wherein the IoU and the ratio are as follows:
in the formula, l _ m is a pixel point marked as belonging to a preset target object in long-term memory, and s _ m is a pixel point marked as belonging to the preset target object in short-term memory;
taking candidate long-term reference frames with ratio values larger than 0.9 and smaller than 1.05 as a long-term reference frame group, and constructing the reference frame group according to the following method:
(1) when IoU is greater than 0.9, constructing a reference frame group by using the long-term reference frame group and the t-3 th frame feature map;
(2) when the reference frame group is not less than 0.6 and not more than IoU and not more than 0.9, constructing the reference frame group by the long-term reference frame group and the short-term reference frame group;
(3) when IoU <0.6, the long-term reference frame group is reselected according to a preset rule, and the reference frame group is constructed with the reselected long-term reference frame group and the short-term reference frame group.
6. The video coloring-based unsupervised dense target tracking method according to claim 2, wherein the preset value of the rate of change of the number of tags is 10%.
7. The unsupervised object dense tracking method based on video coloring as claimed in claim 2, wherein after the long-term reference frame group is reselected in step S44, the pre-background segmentation is performed on each long-term reference frame according to the long-term memory and the short-term memory based on the grabcut method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111609449.4A CN114399531A (en) | 2021-12-24 | 2021-12-24 | Unsupervised target dense tracking method based on video coloring |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111609449.4A CN114399531A (en) | 2021-12-24 | 2021-12-24 | Unsupervised target dense tracking method based on video coloring |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114399531A true CN114399531A (en) | 2022-04-26 |
Family
ID=81226420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111609449.4A Pending CN114399531A (en) | 2021-12-24 | 2021-12-24 | Unsupervised target dense tracking method based on video coloring |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114399531A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6335985B1 (en) * | 1998-01-07 | 2002-01-01 | Kabushiki Kaisha Toshiba | Object extraction apparatus |
US6774917B1 (en) * | 1999-03-11 | 2004-08-10 | Fuji Xerox Co., Ltd. | Methods and apparatuses for interactive similarity searching, retrieval, and browsing of video |
CN109086709A (en) * | 2018-07-27 | 2018-12-25 | 腾讯科技(深圳)有限公司 | Feature Selection Model training method, device and storage medium |
CN111090756A (en) * | 2020-03-24 | 2020-05-01 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based multi-target recommendation model training method and device |
CN112950675A (en) * | 2021-03-18 | 2021-06-11 | 深圳市商汤科技有限公司 | Target tracking method and device, electronic equipment and storage medium |
CN113344976A (en) * | 2021-06-29 | 2021-09-03 | 常州工学院 | Visual tracking method based on target object characterization point estimation |
-
2021
- 2021-12-24 CN CN202111609449.4A patent/CN114399531A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6335985B1 (en) * | 1998-01-07 | 2002-01-01 | Kabushiki Kaisha Toshiba | Object extraction apparatus |
US6774917B1 (en) * | 1999-03-11 | 2004-08-10 | Fuji Xerox Co., Ltd. | Methods and apparatuses for interactive similarity searching, retrieval, and browsing of video |
CN109086709A (en) * | 2018-07-27 | 2018-12-25 | 腾讯科技(深圳)有限公司 | Feature Selection Model training method, device and storage medium |
CN111090756A (en) * | 2020-03-24 | 2020-05-01 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based multi-target recommendation model training method and device |
CN112950675A (en) * | 2021-03-18 | 2021-06-11 | 深圳市商汤科技有限公司 | Target tracking method and device, electronic equipment and storage medium |
CN113344976A (en) * | 2021-06-29 | 2021-09-03 | 常州工学院 | Visual tracking method based on target object characterization point estimation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241913B (en) | Ship detection method and system combining significance detection and deep learning | |
CN112836640B (en) | Single-camera multi-target pedestrian tracking method | |
CN112150493B (en) | Semantic guidance-based screen area detection method in natural scene | |
CN110781756A (en) | Urban road extraction method and device based on remote sensing image | |
CN112884742B (en) | Multi-target real-time detection, identification and tracking method based on multi-algorithm fusion | |
CN112180375B (en) | Weather radar echo extrapolation method based on improved TrajGRU network | |
CN113011329A (en) | Pyramid network based on multi-scale features and dense crowd counting method | |
CN113592911B (en) | Apparent enhanced depth target tracking method | |
CN109993052B (en) | Scale-adaptive target tracking method and system under complex scene | |
CN112560619B (en) | Multi-focus image fusion-based multi-distance bird accurate identification method | |
CN111242026B (en) | Remote sensing image target detection method based on spatial hierarchy perception module and metric learning | |
CN115937254B (en) | Multi-aerial flying target tracking method and system based on semi-supervised learning | |
CN111383250A (en) | Moving target detection method and device based on improved Gaussian mixture model | |
CN111161309A (en) | Searching and positioning method for vehicle-mounted video dynamic target | |
CN113052170A (en) | Small target license plate recognition method under unconstrained scene | |
CN113486894A (en) | Semantic segmentation method for satellite image feature component | |
CN117036397A (en) | Multi-target tracking method based on fusion information association and camera motion compensation | |
CN114973071A (en) | Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics | |
CN110276782B (en) | Hyperspectral target tracking method combining spatial spectral features and related filtering | |
CN113627481A (en) | Multi-model combined unmanned aerial vehicle garbage classification method for smart gardens | |
CN111161323B (en) | Complex scene target tracking method and system based on correlation filtering | |
CN116363557B (en) | Self-learning labeling method, system and medium for continuous frames | |
CN117011655A (en) | Adaptive region selection feature fusion based method, target tracking method and system | |
CN115861384A (en) | Optical flow estimation method and system based on generation of countermeasure and attention mechanism | |
CN114399531A (en) | Unsupervised target dense tracking method based on video coloring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |