CN115393400A - Video target tracking method for single sample learning - Google Patents

Video target tracking method for single sample learning Download PDF

Info

Publication number
CN115393400A
CN115393400A CN202211108906.6A CN202211108906A CN115393400A CN 115393400 A CN115393400 A CN 115393400A CN 202211108906 A CN202211108906 A CN 202211108906A CN 115393400 A CN115393400 A CN 115393400A
Authority
CN
China
Prior art keywords
segmentation
target
tracking
image
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211108906.6A
Other languages
Chinese (zh)
Inventor
曹东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Dongru Technology Co ltd
Original Assignee
Wuxi Dongru Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Dongru Technology Co ltd filed Critical Wuxi Dongru Technology Co ltd
Priority to CN202211108906.6A priority Critical patent/CN115393400A/en
Publication of CN115393400A publication Critical patent/CN115393400A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video target tracking method for single sample learning, which comprises the steps of constructing a video image target object semantic segmentation data set, defining an image semantic segmentation model, constructing an image frame sequence perception feature extraction module and a single sample learning information extraction module
Figure DDA0003843042370000011
The structure of (1), a division tracking module
Figure DDA0003843042370000012
The image segmentation of the subsequent frame is inferred, and the module is used for tracking the segmentation
Figure DDA0003843042370000013
The output optimal division code is used as input, the division decoder
Figure DDA0003843042370000014
The final output of (1) is a multi-channel target semantic segmentation result. The video target tracking method for single-sample learning combines the vectorization expression of pre-training, designs model parameter learning specific to the target object in the reasoning process, better generalizes the appearance information of the target object, and realizes the existing optimal dynamic target tracking, thereby realizing the optimal autonomous action planning of the intelligent robot.

Description

Video target tracking method for single sample learning
Technical Field
The invention relates to the field of intelligent manufacturing and machine vision, in particular to a video target tracking method for single-sample learning.
Background
A video target tracking method for single sample learning is a method for carrying out dynamic target tracking, in an intelligent manufacturing production scene of a digital factory, an intelligent robot needs to carry out dynamic target tracking on a moving target object to be controlled in the process of executing a given production task and realizing autonomous action planning so as to realize downstream real-time accurate control, therefore, target object tracking based on video stream is one of key technologies for effectively realizing the application, the realization way of video target tracking can achieve the purpose by carrying out semantic segmentation on the target object in an image frame by frame on a video sequence, and further realizing the separation of the foreground and the background of the target object, and the autonomous action planning of the intelligent robot is realized.
The existing video target tracking method for single-sample learning has certain disadvantages when in use, the existing work enables a semantic segmentation network to adapt to a video target tracking task through online fine tuning, however, the method easily causes overfitting of the appearance of a target object set for the first frame and high delay caused by exceeding the conventional calculation complexity, other follow-up methods integrate an appearance model specific to the target object into a segmentation model, the running time is improved, end-to-end learning is realized, and in order to generalize a new target object to be tracked, a feature matching technology based on feature embedding is often difficult to effectively realize through deep learning.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a video target tracking method for single-sample learning, which not only combines the vectorization expression of pre-training, but also designs model parameter learning specific to a target object in the reasoning process, better generalizes the appearance information of the target object, and realizes the existing optimal dynamic target tracking, thereby realizing the optimal autonomous action planning of an intelligent robot and effectively solving the problems in the background technology.
(II) technical scheme
In order to achieve the purpose, the invention adopts the technical scheme that: a video target tracking method for single sample learning comprises the following operation steps:
s1: model construction: constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model;
s2: the module structure is as follows: constructing an image frame sequence perception feature extraction module;
s3: sample module construction: single sample learning information extraction module
Figure BDA0003843042350000021
The structure of (2);
s4: the tracking module is constructed as follows: segmentation tracking module
Figure BDA0003843042350000022
The structure of (2);
s5: image segmentation reasoning: single sample learning information extraction module based on first frame image
Figure BDA0003843042350000023
And a segmentation tracking module
Figure BDA0003843042350000024
Reasoning on the segmentation of the subsequent frame image;
s6: and (3) outputting a result: multi-channel object segmentation unit
Figure BDA0003843042350000025
The module is to divide the tracking module
Figure BDA0003843042350000031
The output optimal division code is used as input, the division decoder
Figure BDA0003843042350000032
The final output of (1) is a multi-channel target semantic segmentation result.
As a preferred technical solution of the present application, the step S1 specifically includes the following steps:
a1: constructing a semi-supervised image segmentation data set;
a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of reference target foreground and background segmentation labels given in a first frame, and for a specific video sequence, semantic segmentation labels of a first frame image are often only given, and then inference work of segmenting a target in each subsequent frame is needed;
a3: based on the above dataset features, we define a video object segmentation framework as
Figure BDA0003843042350000033
Wherein
Figure BDA0003843042350000034
The parameters which can be learned are represented and obtained through learning in the model training process;
a4: the video target tracking is realized as an end-to-end network by adopting a video image segmentation method.
As a preferred technical solution of the present application, the step S2 specifically includes the following steps:
b1: the image frame sequence perception feature extraction module comprises a pixel correlation information convergence unit
Figure BDA0003843042350000035
Pixel classifier
Figure BDA0003843042350000036
Attention mechanism unit
Figure BDA0003843042350000037
B2: structured pixel classifier
Figure BDA0003843042350000038
Pixel classifier
Figure BDA0003843042350000039
The input is a single sample real value labeling label Yseg 1 By labeling the input target real value with a label Yseg 1 Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;
b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) comprising an additional frame is realized t ,Yseg t ) Thereby expanding the small sample learning data set.
As a preferred technical solution of the present application, the step S3 specifically includes the following steps:
c1: solving through applying steepest descent iteration so as to obtain compromise in two aspects of precision and efficiency;
c2: all computations are performed using standard neural network operations, from a given initialization
Figure BDA0003843042350000041
Begin to perform N steepest descent iterations, training and reasoning because convergence of steepest descent is fast and efficientDuring the period, the expected convergence effect can be achieved by setting the iteration times N = 5;
c3: setting N =2 for the new optimization iteration number, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;
c4: a single sample learning information extraction module is constructed
Figure BDA0003843042350000042
Can be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combined
Figure BDA0003843042350000043
Value updating, applying to downstream processing links as a predictive split-track module
Figure BDA0003843042350000044
The segmentation code used for obtaining the video tracking target is finally provided as an input to the segmentation decoder.
As a preferred technical solution of the present application, the step S4 specifically includes the following steps:
d1: the depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted;
d2: model parameters in the segmentation and tracking module are extracted by using a single sample learning information extraction module
Figure BDA0003843042350000045
Tracking modules by pair segmentation
Figure BDA0003843042350000046
Output of (3) and generated true value labeling tag
Figure BDA0003843042350000047
Obtained by minimizing the square error between the two, obtained by means of an attention mechanism unit
Figure BDA0003843042350000051
Provided importance per pixel weight
Figure BDA0003843042350000052
Weighting;
d3: segmentation tracking module
Figure BDA0003843042350000053
The method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of pixel mutual information can be extracted under the condition of the simplest calculation amount.
As a preferred technical solution of the present application, the step S5 specifically includes the following operation steps:
e1: first frame image based on video sequence, segmentation tracking module parameter
Figure BDA0003843042350000054
Is directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation;
e2: image annotation of video sequences "(x) 1 ,Yseg 1 ) For "constitute training samples for learning to segment a given target;
e3: predicting the segmentation tracking module parameters based on the previous step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;
e4: annotating Yseg with first frame true values 1 (i.e. correspond to)
Figure BDA0003843042350000055
In (1)
Figure BDA0003843042350000056
Wherein
Figure BDA0003843042350000057
) As a label in our single sample learning information extraction module;
e5: by marking the actual valueLabel Yseg 1 Carry out code generation
Figure BDA0003843042350000058
Thereby allowing partitioning of the tracking module
Figure BDA0003843042350000059
Predicting a richer representation of the object segmentation in the test frame;
e6: the method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.
As a preferred technical solution of the present application, the step S6 specifically includes the following operation steps:
f1: multi-channel object segmentation unit
Figure BDA0003843042350000061
Includes three information streams: one is real-time optimal segmentation code output of a segmentation tracking module; secondly, a single sample learning information extraction module
Figure BDA0003843042350000062
Output of (2)
Figure BDA0003843042350000063
Thirdly, outputting the image frame sequence perception feature extraction module;
f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: one is a single sample learning information extraction module
Figure BDA0003843042350000064
Output of (2)
Figure BDA0003843042350000065
Secondly, outputting the image frame sequence sensing feature extraction module;
f3: the memory library stores semantic information by setting dynamic parameters into regional semantic comparison and pixel semantic aggregation, and stores the semantic information
Figure BDA0003843042350000066
By
Figure BDA0003843042350000067
Each target feature is composed;
f4: the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; second, the embedded representation of the memory bank output;
f5: target segmentation decoding module slave memory bank
Figure BDA0003843042350000068
All class feature representations derived in (1) are expressed in real tensor form
Figure BDA0003843042350000069
F6: further on
Figure BDA00038430423500000610
The global context of the front and back frame sequences between the images is captured, so that the representability of semantic understanding can be enriched;
f7: and the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.
As a preferred technical solution of the present application, the steps S1 to S6 are combined with a pre-trained vectorization expression, and a model parameter specific to a target object is designed to learn in an inference process, so as to generalize appearance information of the target object, thereby implementing the existing optimal dynamic target tracking.
(III) advantageous effects
Compared with the prior art, the invention provides a video target tracking method for single-sample learning, which has the following beneficial effects: the video target tracking method for single-sample learning not only combines the vectorization expression of pre-training, but also designs model parameter learning specific to the target object in the reasoning process, thereby better generalizing the appearance information of the target object and realizing the optimal dynamic target tracking in the prior art, thereby realizing the optimal autonomous action planning of the intelligent robot. Video target tracking refers to a process of positioning a moving target object in a video time sequence, including tracking of a single target and multiple targets, and related problems include occlusion and motion blur, viewpoint and scale change, background and illumination change, moving of the target object out of a view field in the tracking process, and the like. The small sample learning method is a machine learning method which adopts little labeled data combined with a large amount of unlabeled data for classification or regression, compared with a mainstream deep learning method, the method can effectively train a model only by means of a large amount of labeled data sets so as to achieve acceptable generalization performance, the small sample learning method effectively solves the problem of lack of labeled data sets in an actual production scene, obviously reduces the labor cost and time cost of data labeling, and obviously improves the robustness, generalization performance and the like of an algorithm system. Compared with other existing similar single-sample learning methods, the method has the advantages that target information extracted in the training and learning stage is richer, key association information of front and rear frames of front and rear video sequence images and context semantic information between far and near pixels of the images can be effectively memorized, and accuracy of multi-target construction is improved; furthermore, the calculation time consumption of the single-sample learning method in the model inference stage is much lower than that of other existing models, and the real-time performance of algorithm inference is improved.
1. The method successfully realizes the video target tracking of single-sample learning, and the performance is due to the existing other similar methods. Segmentation tracking module
Figure BDA0003843042350000081
Learning an initial segmentation truth value labeling label for predicting the target object from the first frame. The actual value is labeled by the network
Figure BDA0003843042350000082
Refining and refining, the network constructed by the invention has strong learning segmentation prior capability. And also
Figure BDA0003843042350000083
It is not limited to operating on approximate target annotation tags to perform conditional segmentation of target objects. Yseg 1 As a label in a single sample learning information extraction module, a trainable pixel classifier is introduced
Figure BDA0003843042350000084
Through the automatic machine learning realization of a learning internal single-sample learning information extraction module, instead of only directly and simply using the first frame true value for marking, the segmentation tracking module can realize the prediction multi-channel segmentation, thereby being capable of providing strong associated target perception information,
Figure BDA0003843042350000085
the segmentation prediction is more accurate.
2. STM realizes video object segmentation using space-time memory network, provides a novel solution for object segmentation of semi-supervised video, and can change video frames with object masks into richer intermediate prediction according to the nature of the problem. FEELVOS proposes fast end-to-end embedding learning for video object segmentation. PreMVOS first generates an accurate set of object segmentation mask proposals for each video frame, then selects and merges these proposals into an accurate and time-consistent pixel object tracking across the video sequence in a designed manner, specifically addressing the problems involved in segmenting multiple objects in a video sequence. FRTM proposes a new type of VOS architecture consisting of two network components, the target appearance model consisting of a lightweight module that learns in the inference phase using fast optimization techniquesA coarse but robust object segmentation is predicted. SiamRCNN, in combination with a dynamic programming-based algorithm, models the complete history of objects to be tracked and potentially interfering objects using re-detection of the first frame template and the previous frame prediction. The ags vos segments a path of all object instances simultaneously by segmenting multiple object instance-independent and instance-specific modules in a feed-forward path, with information from both modules being passed through an attention-directed fusion decoder. Compared with other similar methods, the method realizes the optimal performance and the average Jacobian of the existing optimal method
Figure BDA0003843042350000094
Score increased by 2.9, margin
Figure BDA0003843042350000096
The score is improved by 2.5
Figure BDA0003843042350000095
The aspect is improved by 3.4% compared with an optimal method STM, and the performance of the novel method in a video target tracking algorithm is optimal as proved by a test result.
3. In the algorithmic reasoning process, a test sequence is given, and together with a first frame label, an initial training set including single sample pairs is firstly created for the single sample learning information extraction module. A feature map extracted from the first frame. Then, the single sample learning information extraction module predicts the parameters of the segmentation tracking module through iterative learning
Figure BDA0003843042350000091
The initial estimation of the segmentation tracking module is set to be all zero so as to simplify the calculation complexity and improve the real-time performance of the system. Then the learned model is
Figure BDA0003843042350000092
Is applied to the subsequent test frame Ima 2 To obtain the label code
Figure BDA0003843042350000093
To adapt to a sceneWe further update our segmentation tracking module with frame information from processing based on a memory-base approach. We ensure that the memory is updated by deleting stale samples. The first frame is reserved, if a video sequence comprises a plurality of targets, each target is independently processed in parallel, a re-detection first frame template and the predication of the previous frame are utilized to model complete historical tracking and potential interference objects of two objects, the optimal tracking decision is realized by the method, and the re-detection tracking of the target object after long-time occlusion is optimized. Experiments verify that the performance of the algorithm is optimal.
The whole video target tracking method for single-sample learning is simple and convenient to operate, and the using effect is better than that of a traditional mode.
Drawings
Fig. 1 is a schematic diagram of an overall algorithm flow of a single-sample learning video target tracking method according to the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and the detailed description, but those skilled in the art will understand that the following described embodiments are some, not all, of the embodiments of the present invention, and are only used for illustrating the present invention, and should not be construed as limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The instruments used are not indicated by the manufacturer, and are all conventional products available by commercial purchase.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
As shown in fig. 1, a single-sample learning video target tracking method includes the following operation steps:
s1: model construction: constructing a video image target object semantic segmentation data set and defining an image semantic segmentation model;
s2: the module structure is as follows: constructing an image frame sequence perception feature extraction module;
s3: sample module construction: single sample learning information extraction module
Figure BDA0003843042350000111
The structure of (2);
s4: the tracking module is constructed as follows: segmentation tracking module
Figure BDA0003843042350000112
The structure of (1);
s5: and (3) image segmentation reasoning: single sample learning information extraction module based on first frame image
Figure BDA0003843042350000113
And a segmentation tracking module
Figure BDA0003843042350000114
Reasoning on the segmentation of the subsequent frame image;
s6: and outputting a result: multi-channel object segmentation unit
Figure BDA0003843042350000115
Will split the tracking module
Figure BDA0003843042350000116
Output optimal partition encoding as input, partition decoder
Figure BDA0003843042350000117
The final output of (1) is a multi-channel target semantic segmentation result.
Further, the step S1 specifically includes the following steps:
a1: constructing a semi-supervised image segmentation data set;
a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of segmentation and labeling of a reference target foreground and a reference target background given in a first frame, a semantic segmentation and labeling label of a first frame image of a specific video sequence is often given only, and then inference work of segmenting a target in each subsequent frame is required;
a3: based on the above dataset features, we define a video object segmentation framework as
Figure BDA0003843042350000118
Wherein
Figure BDA0003843042350000119
Representing learnable parameters, obtained by learning in the model training process;
a4: the video target tracking is realized as an end-to-end network by adopting a video image segmentation method.
Further, the step S2 specifically includes the following operation steps:
b1: the image frame sequence perception feature extraction module comprises a pixel associated information gathering unit
Figure BDA0003843042350000121
Pixel classifier
Figure BDA0003843042350000122
Attention mechanism unit
Figure BDA0003843042350000123
B2: structured pixel classifier
Figure BDA0003843042350000124
PixelClassifier
Figure BDA0003843042350000125
The input is a single sample real value labeling label Yseg 1 By labeling the actual value of the input target with a label Yseg 1 Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;
b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) containing an additional frame is realized t ,Yseg t ) Thereby expanding the small sample learning data set.
Further, the step S3 specifically includes the following operation steps:
c1: solving through applying steepest descent iteration so as to obtain compromise in two aspects of precision and efficiency;
c2: all computations are performed using standard neural network operations, from a given initialization
Figure BDA0003843042350000126
Starting to execute N steepest descent iterations, wherein convergence of steepest descent is fast and efficient, and the expected convergence effect can be achieved by setting the iteration number N =5 in the training and reasoning period;
c3: setting N =2 for the new optimization iteration number, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;
c4: a single sample learning information extraction module is constructed
Figure BDA0003843042350000127
Can be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combined
Figure BDA0003843042350000128
Value updating, applying to downstream processing links as a predictive segmentation tracking module
Figure BDA0003843042350000129
The segmentation code used to obtain the video tracking target is finally provided as input to the segmentation decoder.
Further, the step S4 specifically includes the following operation steps:
d1: the depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted;
d2: model parameters in the segmentation and tracking module are extracted by using a single sample learning information extraction module
Figure BDA0003843042350000131
Tracking modules by pair segmentation
Figure BDA0003843042350000132
Output of (3) and generated true value labeling tag
Figure BDA0003843042350000133
Obtained by minimizing the square error between the two, obtained by means of an attention mechanism unit
Figure BDA0003843042350000134
Provided importance per pixel weight
Figure BDA0003843042350000135
Weighting;
d3: segmentation tracking module
Figure BDA0003843042350000136
The method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of pixel mutual information can be extracted under the condition of the simplest calculation amount.
Further, the step S5 specifically includes the following steps:
e1: first frame image based on video sequence, segmentation tracking module parameter
Figure BDA0003843042350000137
Is directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation;
e2: image annotation of video sequences "(x) 1 ,Yseg 1 ) For "constitute training samples for learning to segment a given target;
e3: predicting the segmentation tracking module parameters based on the last step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;
e4: annotating Yseg with first frame true values 1 (i.e. correspond to
Figure BDA0003843042350000138
In
Figure BDA0003843042350000139
Wherein
Figure BDA00038430423500001310
) As a label in our single sample learning information extraction module;
e5: by labelling the true value with the label Yseg 1 Perform code generation
Figure BDA0003843042350000141
Thereby allowing partitioning of the tracking module
Figure BDA0003843042350000142
Predicting a richer representation of the target segmentation in the test frame;
e6: the method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.
Further, the step S6 specifically includes the following steps:
f1: multi-channel object segmentation unit
Figure BDA0003843042350000143
The input comprises three information streams: firstly, the real-time optimal segmentation code output of the segmentation tracking module; secondly, a single sample learning information extraction module
Figure BDA0003843042350000144
Output of (2)
Figure BDA0003843042350000145
Thirdly, outputting the image frame sequence sensing feature extraction module;
f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction module
Figure BDA0003843042350000146
Output of (2)
Figure BDA0003843042350000147
Secondly, outputting the image frame sequence sensing feature extraction module;
f3: the memory library stores semantic information by setting dynamic parameters into regional semantic comparison and pixel semantic aggregation, and stores the semantic information
Figure BDA0003843042350000148
By
Figure BDA0003843042350000149
Each target feature is composed;
f4: the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; second, the embedded representation of the memory bank output;
f5: target segmentation decoding module slave memory bank
Figure BDA00038430423500001410
All class feature representations derived in (1) are expressed in real tensor form
Figure BDA00038430423500001411
F6: further on
Figure BDA00038430423500001412
The global context of the front and back frame sequences between the images is captured, so that the representability of semantic understanding can be enriched;
f7: and the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.
Furthermore, the steps S1-S6 are combined with the pre-trained vectorization expression, and model parameters specific to the target object are designed to learn and generalize the appearance information of the target object in the reasoning process, so that the existing optimal dynamic target tracking is realized.
Compared with the performance comparative analysis of other similar optimal methods (see the following table), the experimental data set adopts the intelligent manufacturing production scene video target tracking data set constructed by the people and comprises 50 video segments. The evaluation index includes average Jacobian
Figure BDA0003843042350000151
And a boundary
Figure BDA0003843042350000152
Score, and total score
Figure BDA0003843042350000153
The contrast algorithms include STM, FEELVOS, preMVOS, FRTM, siamRCNN, and AGSSVOS.
Figure BDA0003843042350000154
The embodiment is as follows:
the overall method comprises the following implementation steps:
1. and constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model.
2. And constructing an image frame sequence perception feature extraction module.
3. Single sample learning information extraction module
Figure BDA0003843042350000161
The structure of (3).
4. Segmentation tracking module
Figure BDA0003843042350000162
The structure of (1).
5. Single sample learning information extraction module based on first frame image
Figure BDA0003843042350000163
And a segmentation tracking module
Figure BDA0003843042350000164
And (4) reasoning on the segmentation of the subsequent frame image.
6. Multi-channel object segmentation unit
Figure BDA0003843042350000165
The module is to divide the tracking module
Figure BDA0003843042350000166
Output optimal partition encoding as input, partition decoder
Figure BDA0003843042350000167
The final output of (1) is a multi-channel target semantic segmentation result.
The general step 1 is that the semantic segmentation data set construction of the video image target object and the image semantic segmentation model definition are carried out.
Step one, constructing a semi-supervised image segmentation data set. A typical video data set consists of a series of multiple frames of images in chronological order, denoted as
Figure BDA0003843042350000168
Figure BDA0003843042350000169
Representing a video data set, wherein
Figure BDA00038430423500001610
In the representative video
Figure BDA00038430423500001611
Frame time series images of which
Figure BDA00038430423500001612
In a collection
Figure BDA00038430423500001613
Represents
Figure BDA00038430423500001614
Each marked division label corresponding to
Figure BDA00038430423500001615
Frame of a picture
Figure BDA00038430423500001616
To distinguish between foreground objects and image background. The video image segmentation dataset therefore comprises
Figure BDA00038430423500001617
Set of "image-label data" pairs
Figure BDA00038430423500001618
Wherein
Figure BDA00038430423500001619
Also comprises
Figure BDA00038430423500001620
Unsupervised image data is framed.
And step two, in the image segmentation data set of the video target tracking time sequence, the target object is only defined by the labels of the reference target foreground and background segmentation labels given in the first frame. The process belongs to small sample learning, which is to give semantic segmentation labeling labels only for the first frame of image of a specific video sequence, and then to carry out inference work of segmentation targets in each subsequent frame. The representation of the video sequence data sets is in chronological order
Figure BDA0003843042350000171
Wherein
Figure BDA0003843042350000172
Representing the latest frame of image at the current time. Then in this application scenario, only
Figure BDA0003843042350000173
This only one image frame is the annotation data with the real value,
Figure BDA0003843042350000174
all other frames in (1) belong to unsupervised data, meaning that
Figure BDA0003843042350000175
Unsupervised image data of
Figure BDA0003843042350000176
Therefore, the final representation form of the data set of the invention is
Figure BDA0003843042350000177
Step three, based on the characteristics of the data set, defining a video target segmentation framework as
Figure BDA0003843042350000178
Wherein
Figure BDA0003843042350000179
The parameters which represent the learnable parameters are obtained through learning in the model training process. Network
Figure BDA00038430423500001710
Acquiring current image Ima and segmenting and tracking module
Figure BDA00038430423500001711
To output of (c). Although it does not
Figure BDA00038430423500001712
Is not related to the target itself, but it does so
Figure BDA00038430423500001713
Conditional on it integrating information about the target object, coded into its parameters
Figure BDA00038430423500001714
In which
Figure BDA00038430423500001715
Is a convolution kernel that is a function of the original,
Figure BDA00038430423500001716
weights for convolutional layers of kernel size K are formed, thus realizing the integration of information about the target object, encoded into its parameters
Figure BDA00038430423500001717
The object of (1).
Step four, the video target tracking is realized as an end-to-end network by adopting a video image segmentation method, as shown in figure 1, the video target tracking is realized by an image frame sequence perception feature extraction module and a single sample learning information extraction module
Figure BDA00038430423500001718
Segmentation tracking module
Figure BDA00038430423500001719
And a multi-channel object segmentation unit
Figure BDA0003843042350000181
And (4) forming.
Figure BDA0003843042350000182
Represents network parameters learned during offline training, and
Figure BDA0003843042350000183
are parameters predicted by the single sample learning information extraction module during inference.
The above general step 2, the image frame sequence perception feature extraction module is constructed
Step one, the image frame sequence perception feature extraction module comprises a pixel associated information gathering unit
Figure BDA0003843042350000184
Pixel classifier
Figure BDA0003843042350000185
Attention mechanism unit
Figure BDA0003843042350000186
Firstly, constructing a pixel associated information gathering unit
Figure BDA0003843042350000187
NASN is adopted to form a backbone network for pixel associated information gathering unit
Figure BDA0003843042350000188
Wherein the network structure of NASN is constructed using a neural structure search framework. Time-series image frame Ima is inputted to
Figure BDA0003843042350000189
Representing aggregated features as pixel-associated information
Figure BDA00038430423500001810
Then the
Figure BDA00038430423500001811
Is output to three directions, one is input to a multi-channel object segmentation unit
Figure BDA00038430423500001812
The second memory bank of (1) is input into the segmentation tracking module
Figure BDA00038430423500001813
Thirdly, the learning information is input into a single sample learning information extraction module
Figure BDA00038430423500001814
(see FIG. 1, detailed in subsequent steps). Wherein the multi-channel target segmentation unit uses three residual blocks with spatial step size s =16. These features are first fed by an additional conv. At the same time input into
Figure BDA00038430423500001815
Before the start of the operation of the device,
Figure BDA00038430423500001816
is reduced to C =512.
Step two, constructing a pixel classifier
Figure BDA00038430423500001817
Pixel classifier
Figure BDA00038430423500001818
The input of (a) is a single-sample true value labeling label Yseg 1 By labeling the input target real value with a label Yseg 1 And coding to predict the other input image segmentation true value labels of the single-sample learning information extraction module. Pixel classifier
Figure BDA00038430423500001819
The tensor to depth feature mapping is
Figure BDA00038430423500001820
Where H, W, and D are the height, width, and dimensions of the segmentation tracking module features, and s is the feature stride. Pixel classifier
Figure BDA00038430423500001821
Implemented as a hole convolutional network.
Step three, attention mechanism unit
Figure BDA0003843042350000191
The target true value is marked with Yseg as an input to construct an attention mechanism unit, and the mapping tensor is
Figure BDA0003843042350000192
Figure BDA0003843042350000193
According to a convergence function
Figure BDA0003843042350000194
Figure BDA0003843042350000195
Mapping tensor element values are generated. Wherein
Figure BDA0003843042350000196
Or alternatively
Figure BDA0003843042350000197
Is a set of feature segmentation label pairs (x) with the size of Q t ,Tseg t ). Including a single partitioned real value labeled frame (x) 1 ,Yseg 1 ) The method achieves the purpose of automatically labeling a new image label of the next frame by adopting an inference method in a video sequence so as to realize a feature segmentation label labeling pair (x) containing an additional frame t ,Yseg t ) Thereby expanding the small sample learning data set. The scalar λ is a regularization parameter obtained by learning. Note that the force mechanism unit is implemented as a 6 hidden layer convolutional network. Similar to
Figure BDA0003843042350000198
And
Figure BDA0003843042350000199
and sharing the feature mapping tensor structure unit.
The overall step 3, the single sample learning information extraction module
Figure BDA00038430423500001910
Structure of (2)
Step one, constructed single sample learning information extraction module
Figure BDA00038430423500001911
Is continuously microminiature, it will
Figure BDA00038430423500001912
Figure BDA00038430423500001913
Minimization is
Figure BDA00038430423500001914
Figure BDA00038430423500001915
Is that
Figure BDA00038430423500001916
Continuously differentiable function of (a). And (3) applying the steepest descent iteration to solve so as to obtain compromise between the precision and the efficiency. The optimization iterative computation method is expressed as:
Figure BDA00038430423500001917
Figure BDA0003843042350000201
wherein
Figure BDA0003843042350000202
Indicating gradient operation, T indicating transpose, <' > indicating convolution operation, T indicating T-th frame, x t The gathering characteristic of pixel related information, yseg, of the image representing the t-th frame t The segmentation label representing the t-th frame,
Figure BDA0003843042350000203
indicating the result of the i-th iteration for the t-th frame image
Figure BDA0003843042350000204
Value of,
Figure BDA0003843042350000205
indicating the i +1 th iteration calculated for the t-th frame image
Figure BDA0003843042350000206
The value is obtained. Single sample learning information extraction module
Figure BDA0003843042350000207
Is inputted by
Figure BDA0003843042350000208
The output is
Figure BDA0003843042350000209
The weights of convolutional layers with kernel size K are formed, information about the target object is integrated,
Figure BDA00038430423500002010
are parameters predicted by the single-sample learning information extraction module during inference.
And step two, all the calculation of the optimization iterative calculation method is realized by adopting standard neural network operation. Being continuously differentiable, it is possible for all network parameters
Figure BDA00038430423500002011
Segmented tracking module parameters obtained after i iterations
Figure BDA00038430423500002012
And is also differentiable. The single sample learning information extraction module is realized as a network module
Figure BDA00038430423500002013
As mentioned above:
Figure BDA00038430423500002014
is a set of feature segmentation label pairs (x) with the number of Q t ,Yseg t ) From a given initialization
Figure BDA00038430423500002015
N steepest descent iterations are started (the method is as described in step one). Since the steepest descent convergence is fast and efficient, the expected convergence effect is achieved by setting the number of iterations N =5 during training and reasoning.
Step three, obtaining parameters in the last step
Figure BDA00038430423500002017
On the basis, an optimization-based formula is further used
Figure BDA00038430423500002016
Figure BDA0003843042350000211
Tracking module parameters for segmentation
Figure BDA0003843042350000212
Timely updating is carried out by combining with a new input subsequent frame sample, and a new frame image input subsequently is added to the data set
Figure BDA0003843042350000213
And then iterative optimization calculation is applied, the new optimization iteration times are set to be N =2, so that a very ideal updating effect can be achieved, the calculated amount is reduced to the minimum, and real-time processing can be realized.
Step four, the parameters of the segmentation tracking module obtained by the iterative prediction of the first frame firstly through the steepest descent
Figure BDA0003843042350000214
Thus, the single-sample learning information extraction module is constructed
Figure BDA0003843042350000215
Can be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combined
Figure BDA0003843042350000216
Value updating, applying to downstream processing links as a predictive segmentation tracking module
Figure BDA0003843042350000217
The segmentation code used to obtain the video tracking target is finally provided as input to the segmentation decoder.
The above general step 4, the segmentation tracking module
Figure BDA0003843042350000218
The structure of (3).
Step one, constructed segmentation tracking module
Figure BDA0003843042350000219
The structure is implemented as a mapping as follows
Figure BDA00038430423500002110
Single sample learning information extraction module
Figure BDA00038430423500002111
Output result of (2)
Figure BDA00038430423500002112
As
Figure BDA00038430423500002113
As a parameter in the segmentation tracking module. The structure of the segmentation tracking module maps the input C-dimensional depth features x to D-dimensional codes with the same spatial resolution H × W object segmentation by training. To ensure
Figure BDA00038430423500002114
Is continuously differentiable, so that the segmentation tracking module is constructed in the form of
Figure BDA00038430423500002115
Wherein
Figure BDA00038430423500002116
The resulting convolution kernel has a size of K. The depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted.
Step two, the model parameter in the segmentation tracking module uses a single sample learning information extraction module
Figure BDA0003843042350000221
Tracking modules by splitting
Figure BDA0003843042350000222
Output of (3) and generated true value labeling tag
Figure BDA0003843042350000223
Obtained by minimizing the square error between the two, obtained by means of an attention mechanism unit
Figure BDA0003843042350000224
Provided importance per pixel weight
Figure BDA0003843042350000225
And (4) weighting.
Step three, segmenting tracking module
Figure BDA0003843042350000226
The method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of inter-pixel information can be extracted with the simplest amount of computation. The number of output channels D is set to 32.
The above-mentioned overall step 5, the single sample learning information extraction module based on the first frame image
Figure BDA0003843042350000227
And a segmentation tracking module
Figure BDA0003843042350000228
And (4) reasoning on the segmentation of the subsequent frame image.
Step one, baseSegmenting tracking module parameters from the first frame image of the video sequence
Figure BDA0003843042350000229
Is directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation, in which the true value is marked with the label Yseg 1 The feature information of the semantic segmentation of the target object is defined. Is shown as
Figure BDA00038430423500002210
Namely, the learning information is obtained through calculation of a single sample learning information extraction module.
Step two, image annotation of video sequence "(x) 1 ,Yseg 1 ) Pairs constitute training samples for learning to segment a given target. However, the training sample can only be given in the inference stage, and therefore the training sample belongs to the small sample learning problem in video object segmentation, and only a single first frame true value marks the sample actually, strictly speaking, the training sample is a single sample learning problem. Single sample learning information extraction module
Figure BDA00038430423500002211
Labeling from a single true value "(x) 1 ,Yseg 1 ) Pair generation segmentation tracking module
Figure BDA00038430423500002212
The parameters required in (1)
Figure BDA00038430423500002213
The implementation approach is to minimize the supervised learning objective
Figure BDA00038430423500002214
Wherein
Figure BDA00038430423500002215
Following the following rules
Figure BDA0003843042350000231
Step three, based on the imagesElement associated information gathering unit
Figure BDA0003843042350000232
For input image
Figure BDA0003843042350000233
The deep embedded feature of (a) indicates operation, wherein the backbone network employs a DenseNet architecture. Segmentation tracking module
Figure BDA0003843042350000234
The segmentation of the output target object in the initial frame is learned. Given a new frame Ima in the reasoning process, the object is segmented into
Figure BDA0003843042350000235
Instant segmentation tracking module
Figure BDA0003843042350000236
Is applied to the new frame Ima to generate a first new segmentation inference. Due to the strong correlation of successive video frames, a robust prediction of the segmentation of the upcoming frame can be ensured by predicting the segmentation tracking module parameters based on the previous step by directly minimizing the segmentation error in the first frame.
Step four, marking Yseg by using the real value of the first frame 1 (i.e. correspond to)
Figure BDA0003843042350000237
Figure BDA0003843042350000238
In
Figure BDA0003843042350000239
Wherein
Figure BDA00038430423500002310
As a label in our single sample learning information extraction module, in conjunction with subsequent unsupervised image data
Figure BDA00038430423500002311
Figure BDA00038430423500002312
A trainable label generator is introduced
Figure BDA00038430423500002313
The real value labeling segmentation label Yseg is used as an input and predicts the label (namely an image sequence) of a data set used for learning training by a single sample learning information extraction module
Figure BDA00038430423500002314
The label for the single sample learning information extraction module is based on
Figure BDA00038430423500002315
As input followed by
Figure BDA00038430423500002316
Inferred predictive). Thus, the split tracking module parameters are predicted as
Figure BDA00038430423500002317
Fifthly, labeling a label Yseg for the real value 1 Perform code generation
Figure BDA0003843042350000241
To allow partitioning of the tracking module
Figure BDA0003843042350000242
A richer representation of the object segmentation is predicted in the test frame.
Step six, aiming at solving the problem of unbalanced training data set, setting higher weight for a target area; setting lower weight for field-of-view blurred regions, by means of attention mechanism unit
Figure BDA0003843042350000243
Adjusting loss value
Figure BDA0003843042350000244
Using a true value label Yseg as input, and predicting loss
Figure BDA0003843042350000245
The importance weight of each pixel in the image. The method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.
The above general step 6, multi-channel object segmentation unit
Figure BDA0003843042350000246
The module implements a segment tracking module
Figure BDA0003843042350000247
The output optimal division code is used as input, and the target division unit
Figure BDA0003843042350000248
The final output of (2) is a multi-channel target semantic segmentation result.
Step one, multi-channel target segmentation unit
Figure BDA0003843042350000249
Includes three information streams: firstly, the real-time optimal segmentation code output of the segmentation tracking module; secondly, a single sample learning information extraction module
Figure BDA00038430423500002410
Output of (2)
Figure BDA00038430423500002411
And thirdly, outputting the image frame sequence sensing feature extraction module. Multi-channel object segmentation unit
Figure BDA00038430423500002412
Comprises two parts of an object segmentation decoding module and a memory bank, wherein the memory bank construction process comprises the steps from two to threeBy way of introduction, the multi-channel object segmentation unit configuration is introduced at steps four through seven.
Step two, firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction module
Figure BDA00038430423500002413
Output of (2)
Figure BDA00038430423500002414
The other is the output of the image frame sequence perception feature extraction module, and the information streams contain the pixel-level shallow and deep semantic features of the input image. The memory bank functions as: the context semantics of the target objects between target object components in the image and between the front frame sequence and the back frame sequence are very important for understanding the relevance between pixels, the semantic aggregation construction of the pixels is realized by selectively memorizing the characteristic information of the input image sequence by the memory library, the semantic understanding is enhanced by utilizing the context information of the time sequence image frames in the memory library for semantic segmentation, and a large amount of regional semantic segmentation references are provided for constructing the memory library.
Step three, the memory base stores semantic information by setting dynamic parameters as regional semantic comparison and pixel semantic aggregation, and the memory base
Figure BDA0003843042350000251
By
Figure BDA0003843042350000252
A target feature component, i.e.
Figure BDA0003843042350000253
Each target feature corresponds to a category.
Figure BDA0003843042350000254
Each term of (a) represents the number of the observed images Ima in the whole learning phase
Figure BDA0003843042350000255
Global region of individual classDomain aware representation
Figure BDA0003843042350000256
Represent
Figure BDA0003843042350000257
A vector of dimension real values. At each training stage, the memory base updates the new feature learning results. Specifically, the current depth information of the image Ima
Figure BDA0003843042350000258
By passing
Figure BDA0003843042350000259
Figure BDA00038430423500002510
Smooth update to memory representation
Figure BDA00038430423500002511
Wherein the element belongs to the hyper-parameter of the memory bank, and the value range is 0 < element < 0.06. When it comes to
Figure BDA00038430423500002512
Update when a category appears in the image Ima and the classification confidence is greater than a threshold
Figure BDA00038430423500002513
Step four, next, the structure of the target segmentation decoding module is introduced (step four to step seven), and the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; the second is the embedded representation of the memory bank output. The target segmentation decoding module compresses the embedded representations from the memory base into a compact set of representative sample embedded representations. For each category
Figure BDA00038430423500002514
Element(s)
Figure BDA00038430423500002515
All the features in the network are subjected to multi-classification of 7-layer convolutional network to obtain the network with
Figure BDA0003843042350000261
Is a characteristic representation
Figure BDA0003843042350000262
Matrix representation of dimensional vectors
Figure BDA0003843042350000263
Using multiple feature representations for each class (i.e. using multiple feature representations for each class)
Figure BDA0003843042350000264
) To account for feature variations within a class.
Step five, the target segmentation decoding module slave memory bank
Figure BDA0003843042350000265
All class feature representatives derived in (1) are expressed in the form of real tensors
Figure BDA0003843042350000266
Figure BDA0003843042350000267
Then have characteristics for each
Figure BDA0003843042350000268
Wherein W and H represent the width and height of the image, respectively, using an embedded representation
Figure BDA0003843042350000269
Computing its depth semantics
Figure BDA00038430423500002610
Representation of an algorithm
Figure BDA00038430423500002611
Wherein
Figure BDA00038430423500002612
Is that
Figure BDA00038430423500002613
Dimension matrix, tensor
Figure BDA00038430423500002614
And
Figure BDA00038430423500002615
are respectively reduced and flattened into matrix form representation
Figure BDA00038430423500002616
In order to achieve a high-speed calculation,
Figure BDA00038430423500002634
which represents the operation of transposition by means of a transposition operation,
Figure BDA00038430423500002617
represents a matrix multiplication in which
Figure BDA00038430423500002618
Is a hyperbolic tangent function expressed by
Figure BDA00038430423500002619
Each item in (1) reflects
Figure BDA00038430423500002620
Each row (i.e., feature) of and
Figure BDA00038430423500002621
a canonical generalized distance measure between each column (i.e., feature representation) in (a). Depth based semantics
Figure BDA00038430423500002622
And feature embedding
Figure BDA00038430423500002623
Computing a feature representation
Figure BDA00038430423500002624
Wherein
Figure BDA00038430423500002625
To represent
Figure BDA00038430423500002626
Is characterized by adjusting the tensor structure into
Figure BDA00038430423500002627
Step six, then
Figure BDA00038430423500002628
And original characteristics
Figure BDA00038430423500002629
Is connected into
Figure BDA00038430423500002630
Figure BDA00038430423500002631
Not only lie in
Figure BDA00038430423500002632
In encoding local context semantics within an image, further in
Figure BDA00038430423500002633
The global context of the front and back frame sequences between the captured images can enrich the representability of semantic understanding.
Step seven, the classification of the multi-channel target pixels output by the target segmentation decoding module comprises that the trunk network adopts DenseNet to map the input image Ima into convolution expression
Figure BDA0003843042350000271
And (3) adopting a convolutional neural network with the hidden layer depth of 9 to realize classification, generating a classification perception attention diagram from feature embedding, and adopting 5 multiplied by 5 as a convolutional kernel. Memory bank
Figure BDA0003843042350000272
All region patterns are stored in the training data. And the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.
It is noted that, herein, relational terms such as first and second (a, b, etc.) and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims (8)

1. A video target tracking method for single sample learning is characterized in that: the method comprises the following operation steps:
s1: model construction: constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model;
s2: the module structure is as follows: constructing an image frame sequence perception feature extraction module;
s3: sample(s)The module structure is as follows: single sample learning information extraction module
Figure FDA0003843042340000011
The structure of (1);
s4: the tracking module is constructed as follows: segmentation tracking module
Figure FDA0003843042340000012
The structure of (1);
s5: image segmentation reasoning: single sample learning information extraction module based on first frame image
Figure FDA0003843042340000013
And a segmentation tracking module
Figure FDA0003843042340000014
Reasoning on the segmentation of the subsequent frame image;
s6: and outputting a result: multi-channel object segmentation unit
Figure FDA0003843042340000015
The module is to divide the tracking module
Figure FDA0003843042340000016
The output optimal division code is used as input, the division decoder
Figure FDA0003843042340000017
The final output of (1) is a multi-channel target semantic segmentation result.
2. The method of claim 1, wherein the method comprises: the step S1 specifically comprises the following operation steps:
a1: constructing a semi-supervised image segmentation data set;
a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of reference target foreground and background segmentation labels given in a first frame, and for a specific video sequence, semantic segmentation labels of a first frame image are often only given, and then inference work of segmenting a target in each subsequent frame is needed;
a3: based on the above dataset features, we define a video object segmentation framework as
Figure FDA0003843042340000018
Wherein
Figure FDA0003843042340000019
Representing learnable parameters, obtained by learning in the model training process;
a4: the video target tracking is realized as an end-to-end network by adopting a video image segmentation method.
3. The method for tracking the video target through single-sample learning according to claim 1, wherein: the step S2 specifically comprises the following operation steps:
b1: the image frame sequence perception feature extraction module comprises a pixel correlation information convergence unit
Figure FDA0003843042340000021
Pixel classifier
Figure FDA0003843042340000022
Attention mechanism unit
Figure FDA0003843042340000023
B2: structured pixel classifier
Figure FDA0003843042340000024
Pixel classifier
Figure FDA0003843042340000025
The input is a single sample real value labeling label Yseg 1 Through a pair ofInput target true value labeling label Yseg 1 Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;
b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) comprising an additional frame is realized t ,Yseg t ) Thereby expanding the small sample learning data set.
4. The method of claim 1, wherein the method comprises: the step S3 specifically comprises the following operation steps:
c1: solving through applying steepest descent iteration so as to obtain compromise in two aspects of precision and efficiency;
c2: all computations are performed using standard neural network operations, from a given initialization
Figure FDA0003843042340000026
Starting to execute N steepest descent iterations, wherein convergence of steepest descent is fast and efficient, and the expected convergence effect can be achieved by setting the iteration number N =5 in the training and reasoning period;
c3: setting N =2 in the new optimization iteration times, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;
c4: a single sample learning information extraction module is constructed
Figure FDA0003843042340000027
Can be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combined
Figure FDA0003843042340000039
Value updating, applying to downstream processing links as a predictive segmentation tracking module
Figure FDA0003843042340000031
The segmentation code used for obtaining the video tracking target is finally provided as an input to the segmentation decoder.
5. The method of claim 1, wherein the method comprises: the step S4 specifically comprises the following operation steps:
d1: the depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted;
d2: model parameters in the segmentation and tracking module are extracted by using a single sample learning information extraction module
Figure FDA0003843042340000032
Tracking modules by pair segmentation
Figure FDA0003843042340000033
Output of (3) and generated true value labeling tag
Figure FDA0003843042340000034
Obtained by minimizing the square error between the two, obtained by means of an attention mechanism unit
Figure FDA0003843042340000035
Provided importance per pixel weight
Figure FDA0003843042340000036
Weighting;
d3: segmentation tracking module
Figure FDA0003843042340000037
The method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of pixel mutual information can be extracted under the condition of the simplest calculation amount.
6. The method for tracking the video target through single-sample learning according to claim 1, wherein: the step S5 specifically comprises the following operation steps:
e1: first frame image based on video sequence, segmentation tracking module parameter
Figure FDA0003843042340000038
Is directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation;
e2: image annotation of video sequences "(x) 1 ,Yseg 1 ) For "constitute training samples for learning to segment a given target;
e3: predicting the segmentation tracking module parameters based on the last step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;
e4: annotating Yseg with first frame true values 1 (i.e. correspond to
Figure FDA0003843042340000041
In (1)
Figure FDA0003843042340000042
Wherein
Figure FDA0003843042340000043
) As a label in our single sample learning information extraction module;
e5: by labeling the actual value with the label Yseg 1 Carry out code generation
Figure FDA0003843042340000044
To allow partitioning of the tracking module
Figure FDA0003843042340000045
Predicting a richer representation of the target segmentation in the test frame;
e6: the method can guide the single-sample learning information extraction module to optimize learning and converge at the highest speed, and realize the real-time optimal segmentation coding output of the segmentation tracking module.
7. The method of claim 1, wherein the method comprises: the step S6 specifically comprises the following operation steps:
f1: multi-channel object segmentation unit
Figure FDA0003843042340000046
The input of (a) includes three information streams: one is real-time optimal segmentation code output of a segmentation tracking module; secondly, a single sample learning information extraction module
Figure FDA0003843042340000047
Output of (2)
Figure FDA0003843042340000048
Thirdly, outputting the image frame sequence perception feature extraction module;
f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction module
Figure FDA0003843042340000049
Output of (2)
Figure FDA00038430423400000410
Secondly, outputting the image frame sequence sensing feature extraction module;
f3: the memory library stores semantic information by setting dynamic parameters into regional semantic comparison and pixel semantic aggregation, and stores the semantic information
Figure FDA00038430423400000411
By
Figure FDA00038430423400000412
Each target feature is composed;
f4: the input information stream of the target segmentation decoding module comprises two branches: one is real-time optimal segmentation code output of a segmentation tracking module; second, the embedded representation of the memory bank output;
f5: target segmentation decoding module slave memory bank
Figure FDA0003843042340000051
All class feature representatives derived in (1) are expressed in the form of real tensors
Figure FDA0003843042340000052
F6: further on
Figure FDA0003843042340000053
The global context of the front and back frame sequences between the images is captured, so that the representability of semantic understanding can be enriched;
f7: and the memory base reserves compressed global feature representative representation in an inference stage and outputs a video target tracking result through feature information fusion.
8. The method of claim 1, wherein the method comprises: and combining the pre-trained vectorization expression in the steps S1-S6, designing model parameters specific to the target object in the reasoning process to study, generalizing the appearance information of the target object, and realizing the existing optimal dynamic target tracking.
CN202211108906.6A 2022-09-13 2022-09-13 Video target tracking method for single sample learning Pending CN115393400A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211108906.6A CN115393400A (en) 2022-09-13 2022-09-13 Video target tracking method for single sample learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211108906.6A CN115393400A (en) 2022-09-13 2022-09-13 Video target tracking method for single sample learning

Publications (1)

Publication Number Publication Date
CN115393400A true CN115393400A (en) 2022-11-25

Family

ID=84125715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211108906.6A Pending CN115393400A (en) 2022-09-13 2022-09-13 Video target tracking method for single sample learning

Country Status (1)

Country Link
CN (1) CN115393400A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965758A (en) * 2022-12-28 2023-04-14 无锡东如科技有限公司 Three-dimensional reconstruction method for image cooperation monocular instance

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965758A (en) * 2022-12-28 2023-04-14 无锡东如科技有限公司 Three-dimensional reconstruction method for image cooperation monocular instance

Similar Documents

Publication Publication Date Title
Zhao et al. Object detection with deep learning: A review
Fiaz et al. Handcrafted and deep trackers: Recent visual object tracking approaches and trends
Kim et al. Multi-object tracking with neural gating using bilinear lstm
CN110298404B (en) Target tracking method based on triple twin Hash network learning
Cui et al. Efficient human motion prediction using temporal convolutional generative adversarial network
CN112926396B (en) Action identification method based on double-current convolution attention
CN110781262B (en) Semantic map construction method based on visual SLAM
CN109443382A (en) Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network
CN111931602A (en) Multi-stream segmented network human body action identification method and system based on attention mechanism
Bilal et al. A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes
CN112884742A (en) Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method
Hakim et al. Survey: Convolution neural networks in object detection
Mocanu et al. Single object tracking using offline trained deep regression networks
Saqib et al. Intelligent dynamic gesture recognition using CNN empowered by edit distance
Ning et al. Deep Spatial/temporal-level feature engineering for Tennis-based action recognition
CN116912804A (en) Efficient anchor-frame-free 3-D target detection and tracking method and model
CN115393400A (en) Video target tracking method for single sample learning
Couprie et al. Joint future semantic and instance segmentation prediction
Silva et al. Online weighted one-class ensemble for feature selection in background/foreground separation
Ben Mahjoub et al. An efficient end-to-end deep learning architecture for activity classification
Qiu et al. Using stacked sparse auto-encoder and superpixel CRF for long-term visual scene understanding of UGVs
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN110378938A (en) A kind of monotrack method based on residual error Recurrent networks
CN115100740A (en) Human body action recognition and intention understanding method, terminal device and storage medium
CN115034459A (en) Pedestrian trajectory time sequence prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination