CN115393400A - Video target tracking method for single sample learning - Google Patents
Video target tracking method for single sample learning Download PDFInfo
- Publication number
- CN115393400A CN115393400A CN202211108906.6A CN202211108906A CN115393400A CN 115393400 A CN115393400 A CN 115393400A CN 202211108906 A CN202211108906 A CN 202211108906A CN 115393400 A CN115393400 A CN 115393400A
- Authority
- CN
- China
- Prior art keywords
- segmentation
- target
- tracking
- image
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video target tracking method for single sample learning, which comprises the steps of constructing a video image target object semantic segmentation data set, defining an image semantic segmentation model, constructing an image frame sequence perception feature extraction module and a single sample learning information extraction moduleThe structure of (1), a division tracking moduleThe image segmentation of the subsequent frame is inferred, and the module is used for tracking the segmentationThe output optimal division code is used as input, the division decoderThe final output of (1) is a multi-channel target semantic segmentation result. The video target tracking method for single-sample learning combines the vectorization expression of pre-training, designs model parameter learning specific to the target object in the reasoning process, better generalizes the appearance information of the target object, and realizes the existing optimal dynamic target tracking, thereby realizing the optimal autonomous action planning of the intelligent robot.
Description
Technical Field
The invention relates to the field of intelligent manufacturing and machine vision, in particular to a video target tracking method for single-sample learning.
Background
A video target tracking method for single sample learning is a method for carrying out dynamic target tracking, in an intelligent manufacturing production scene of a digital factory, an intelligent robot needs to carry out dynamic target tracking on a moving target object to be controlled in the process of executing a given production task and realizing autonomous action planning so as to realize downstream real-time accurate control, therefore, target object tracking based on video stream is one of key technologies for effectively realizing the application, the realization way of video target tracking can achieve the purpose by carrying out semantic segmentation on the target object in an image frame by frame on a video sequence, and further realizing the separation of the foreground and the background of the target object, and the autonomous action planning of the intelligent robot is realized.
The existing video target tracking method for single-sample learning has certain disadvantages when in use, the existing work enables a semantic segmentation network to adapt to a video target tracking task through online fine tuning, however, the method easily causes overfitting of the appearance of a target object set for the first frame and high delay caused by exceeding the conventional calculation complexity, other follow-up methods integrate an appearance model specific to the target object into a segmentation model, the running time is improved, end-to-end learning is realized, and in order to generalize a new target object to be tracked, a feature matching technology based on feature embedding is often difficult to effectively realize through deep learning.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a video target tracking method for single-sample learning, which not only combines the vectorization expression of pre-training, but also designs model parameter learning specific to a target object in the reasoning process, better generalizes the appearance information of the target object, and realizes the existing optimal dynamic target tracking, thereby realizing the optimal autonomous action planning of an intelligent robot and effectively solving the problems in the background technology.
(II) technical scheme
In order to achieve the purpose, the invention adopts the technical scheme that: a video target tracking method for single sample learning comprises the following operation steps:
s1: model construction: constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model;
s2: the module structure is as follows: constructing an image frame sequence perception feature extraction module;
s3: sample module construction: single sample learning information extraction moduleThe structure of (2);
s4: the tracking module is constructed as follows: segmentation tracking moduleThe structure of (2);
s5: image segmentation reasoning: single sample learning information extraction module based on first frame imageAnd a segmentation tracking moduleReasoning on the segmentation of the subsequent frame image;
s6: and (3) outputting a result: multi-channel object segmentation unitThe module is to divide the tracking moduleThe output optimal division code is used as input, the division decoderThe final output of (1) is a multi-channel target semantic segmentation result.
As a preferred technical solution of the present application, the step S1 specifically includes the following steps:
a1: constructing a semi-supervised image segmentation data set;
a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of reference target foreground and background segmentation labels given in a first frame, and for a specific video sequence, semantic segmentation labels of a first frame image are often only given, and then inference work of segmenting a target in each subsequent frame is needed;
a3: based on the above dataset features, we define a video object segmentation framework asWhereinThe parameters which can be learned are represented and obtained through learning in the model training process;
a4: the video target tracking is realized as an end-to-end network by adopting a video image segmentation method.
As a preferred technical solution of the present application, the step S2 specifically includes the following steps:
b1: the image frame sequence perception feature extraction module comprises a pixel correlation information convergence unitPixel classifierAttention mechanism unit
B2: structured pixel classifierPixel classifierThe input is a single sample real value labeling label Yseg 1 By labeling the input target real value with a label Yseg 1 Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;
b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) comprising an additional frame is realized t ,Yseg t ) Thereby expanding the small sample learning data set.
As a preferred technical solution of the present application, the step S3 specifically includes the following steps:
c1: solving through applying steepest descent iteration so as to obtain compromise in two aspects of precision and efficiency;
c2: all computations are performed using standard neural network operations, from a given initializationBegin to perform N steepest descent iterations, training and reasoning because convergence of steepest descent is fast and efficientDuring the period, the expected convergence effect can be achieved by setting the iteration times N = 5;
c3: setting N =2 for the new optimization iteration number, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;
c4: a single sample learning information extraction module is constructedCan be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combinedValue updating, applying to downstream processing links as a predictive split-track moduleThe segmentation code used for obtaining the video tracking target is finally provided as an input to the segmentation decoder.
As a preferred technical solution of the present application, the step S4 specifically includes the following steps:
d1: the depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted;
d2: model parameters in the segmentation and tracking module are extracted by using a single sample learning information extraction moduleTracking modules by pair segmentationOutput of (3) and generated true value labeling tagObtained by minimizing the square error between the two, obtained by means of an attention mechanism unitProvided importance per pixel weightWeighting;
d3: segmentation tracking moduleThe method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of pixel mutual information can be extracted under the condition of the simplest calculation amount.
As a preferred technical solution of the present application, the step S5 specifically includes the following operation steps:
e1: first frame image based on video sequence, segmentation tracking module parameterIs directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation;
e2: image annotation of video sequences "(x) 1 ,Yseg 1 ) For "constitute training samples for learning to segment a given target;
e3: predicting the segmentation tracking module parameters based on the previous step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;
e4: annotating Yseg with first frame true values 1 (i.e. correspond to)In (1)Wherein) As a label in our single sample learning information extraction module;
e5: by marking the actual valueLabel Yseg 1 Carry out code generationThereby allowing partitioning of the tracking modulePredicting a richer representation of the object segmentation in the test frame;
e6: the method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.
As a preferred technical solution of the present application, the step S6 specifically includes the following operation steps:
f1: multi-channel object segmentation unitIncludes three information streams: one is real-time optimal segmentation code output of a segmentation tracking module; secondly, a single sample learning information extraction moduleOutput of (2)Thirdly, outputting the image frame sequence perception feature extraction module;
f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: one is a single sample learning information extraction moduleOutput of (2)Secondly, outputting the image frame sequence sensing feature extraction module;
f3: the memory library stores semantic information by setting dynamic parameters into regional semantic comparison and pixel semantic aggregation, and stores the semantic informationByEach target feature is composed;
f4: the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; second, the embedded representation of the memory bank output;
f5: target segmentation decoding module slave memory bankAll class feature representations derived in (1) are expressed in real tensor form
F6: further onThe global context of the front and back frame sequences between the images is captured, so that the representability of semantic understanding can be enriched;
f7: and the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.
As a preferred technical solution of the present application, the steps S1 to S6 are combined with a pre-trained vectorization expression, and a model parameter specific to a target object is designed to learn in an inference process, so as to generalize appearance information of the target object, thereby implementing the existing optimal dynamic target tracking.
(III) advantageous effects
Compared with the prior art, the invention provides a video target tracking method for single-sample learning, which has the following beneficial effects: the video target tracking method for single-sample learning not only combines the vectorization expression of pre-training, but also designs model parameter learning specific to the target object in the reasoning process, thereby better generalizing the appearance information of the target object and realizing the optimal dynamic target tracking in the prior art, thereby realizing the optimal autonomous action planning of the intelligent robot. Video target tracking refers to a process of positioning a moving target object in a video time sequence, including tracking of a single target and multiple targets, and related problems include occlusion and motion blur, viewpoint and scale change, background and illumination change, moving of the target object out of a view field in the tracking process, and the like. The small sample learning method is a machine learning method which adopts little labeled data combined with a large amount of unlabeled data for classification or regression, compared with a mainstream deep learning method, the method can effectively train a model only by means of a large amount of labeled data sets so as to achieve acceptable generalization performance, the small sample learning method effectively solves the problem of lack of labeled data sets in an actual production scene, obviously reduces the labor cost and time cost of data labeling, and obviously improves the robustness, generalization performance and the like of an algorithm system. Compared with other existing similar single-sample learning methods, the method has the advantages that target information extracted in the training and learning stage is richer, key association information of front and rear frames of front and rear video sequence images and context semantic information between far and near pixels of the images can be effectively memorized, and accuracy of multi-target construction is improved; furthermore, the calculation time consumption of the single-sample learning method in the model inference stage is much lower than that of other existing models, and the real-time performance of algorithm inference is improved.
1. The method successfully realizes the video target tracking of single-sample learning, and the performance is due to the existing other similar methods. Segmentation tracking moduleLearning an initial segmentation truth value labeling label for predicting the target object from the first frame. The actual value is labeled by the networkRefining and refining, the network constructed by the invention has strong learning segmentation prior capability. And alsoIt is not limited to operating on approximate target annotation tags to perform conditional segmentation of target objects. Yseg 1 As a label in a single sample learning information extraction module, a trainable pixel classifier is introducedThrough the automatic machine learning realization of a learning internal single-sample learning information extraction module, instead of only directly and simply using the first frame true value for marking, the segmentation tracking module can realize the prediction multi-channel segmentation, thereby being capable of providing strong associated target perception information,the segmentation prediction is more accurate.
2. STM realizes video object segmentation using space-time memory network, provides a novel solution for object segmentation of semi-supervised video, and can change video frames with object masks into richer intermediate prediction according to the nature of the problem. FEELVOS proposes fast end-to-end embedding learning for video object segmentation. PreMVOS first generates an accurate set of object segmentation mask proposals for each video frame, then selects and merges these proposals into an accurate and time-consistent pixel object tracking across the video sequence in a designed manner, specifically addressing the problems involved in segmenting multiple objects in a video sequence. FRTM proposes a new type of VOS architecture consisting of two network components, the target appearance model consisting of a lightweight module that learns in the inference phase using fast optimization techniquesA coarse but robust object segmentation is predicted. SiamRCNN, in combination with a dynamic programming-based algorithm, models the complete history of objects to be tracked and potentially interfering objects using re-detection of the first frame template and the previous frame prediction. The ags vos segments a path of all object instances simultaneously by segmenting multiple object instance-independent and instance-specific modules in a feed-forward path, with information from both modules being passed through an attention-directed fusion decoder. Compared with other similar methods, the method realizes the optimal performance and the average Jacobian of the existing optimal methodScore increased by 2.9, marginThe score is improved by 2.5The aspect is improved by 3.4% compared with an optimal method STM, and the performance of the novel method in a video target tracking algorithm is optimal as proved by a test result.
3. In the algorithmic reasoning process, a test sequence is given, and together with a first frame label, an initial training set including single sample pairs is firstly created for the single sample learning information extraction module. A feature map extracted from the first frame. Then, the single sample learning information extraction module predicts the parameters of the segmentation tracking module through iterative learningThe initial estimation of the segmentation tracking module is set to be all zero so as to simplify the calculation complexity and improve the real-time performance of the system. Then the learned model isIs applied to the subsequent test frame Ima 2 To obtain the label codeTo adapt to a sceneWe further update our segmentation tracking module with frame information from processing based on a memory-base approach. We ensure that the memory is updated by deleting stale samples. The first frame is reserved, if a video sequence comprises a plurality of targets, each target is independently processed in parallel, a re-detection first frame template and the predication of the previous frame are utilized to model complete historical tracking and potential interference objects of two objects, the optimal tracking decision is realized by the method, and the re-detection tracking of the target object after long-time occlusion is optimized. Experiments verify that the performance of the algorithm is optimal.
The whole video target tracking method for single-sample learning is simple and convenient to operate, and the using effect is better than that of a traditional mode.
Drawings
Fig. 1 is a schematic diagram of an overall algorithm flow of a single-sample learning video target tracking method according to the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and the detailed description, but those skilled in the art will understand that the following described embodiments are some, not all, of the embodiments of the present invention, and are only used for illustrating the present invention, and should not be construed as limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The instruments used are not indicated by the manufacturer, and are all conventional products available by commercial purchase.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
As shown in fig. 1, a single-sample learning video target tracking method includes the following operation steps:
s1: model construction: constructing a video image target object semantic segmentation data set and defining an image semantic segmentation model;
s2: the module structure is as follows: constructing an image frame sequence perception feature extraction module;
s3: sample module construction: single sample learning information extraction moduleThe structure of (2);
s4: the tracking module is constructed as follows: segmentation tracking moduleThe structure of (1);
s5: and (3) image segmentation reasoning: single sample learning information extraction module based on first frame imageAnd a segmentation tracking moduleReasoning on the segmentation of the subsequent frame image;
s6: and outputting a result: multi-channel object segmentation unitWill split the tracking moduleOutput optimal partition encoding as input, partition decoderThe final output of (1) is a multi-channel target semantic segmentation result.
Further, the step S1 specifically includes the following steps:
a1: constructing a semi-supervised image segmentation data set;
a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of segmentation and labeling of a reference target foreground and a reference target background given in a first frame, a semantic segmentation and labeling label of a first frame image of a specific video sequence is often given only, and then inference work of segmenting a target in each subsequent frame is required;
a3: based on the above dataset features, we define a video object segmentation framework asWhereinRepresenting learnable parameters, obtained by learning in the model training process;
a4: the video target tracking is realized as an end-to-end network by adopting a video image segmentation method.
Further, the step S2 specifically includes the following operation steps:
b1: the image frame sequence perception feature extraction module comprises a pixel associated information gathering unitPixel classifierAttention mechanism unit
B2: structured pixel classifierPixelClassifierThe input is a single sample real value labeling label Yseg 1 By labeling the actual value of the input target with a label Yseg 1 Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;
b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) containing an additional frame is realized t ,Yseg t ) Thereby expanding the small sample learning data set.
Further, the step S3 specifically includes the following operation steps:
c1: solving through applying steepest descent iteration so as to obtain compromise in two aspects of precision and efficiency;
c2: all computations are performed using standard neural network operations, from a given initializationStarting to execute N steepest descent iterations, wherein convergence of steepest descent is fast and efficient, and the expected convergence effect can be achieved by setting the iteration number N =5 in the training and reasoning period;
c3: setting N =2 for the new optimization iteration number, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;
c4: a single sample learning information extraction module is constructedCan be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combinedValue updating, applying to downstream processing links as a predictive segmentation tracking moduleThe segmentation code used to obtain the video tracking target is finally provided as input to the segmentation decoder.
Further, the step S4 specifically includes the following operation steps:
d1: the depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted;
d2: model parameters in the segmentation and tracking module are extracted by using a single sample learning information extraction moduleTracking modules by pair segmentationOutput of (3) and generated true value labeling tagObtained by minimizing the square error between the two, obtained by means of an attention mechanism unitProvided importance per pixel weightWeighting;
d3: segmentation tracking moduleThe method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of pixel mutual information can be extracted under the condition of the simplest calculation amount.
Further, the step S5 specifically includes the following steps:
e1: first frame image based on video sequence, segmentation tracking module parameterIs directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation;
e2: image annotation of video sequences "(x) 1 ,Yseg 1 ) For "constitute training samples for learning to segment a given target;
e3: predicting the segmentation tracking module parameters based on the last step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;
e4: annotating Yseg with first frame true values 1 (i.e. correspond toInWherein) As a label in our single sample learning information extraction module;
e5: by labelling the true value with the label Yseg 1 Perform code generationThereby allowing partitioning of the tracking modulePredicting a richer representation of the target segmentation in the test frame;
e6: the method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.
Further, the step S6 specifically includes the following steps:
f1: multi-channel object segmentation unitThe input comprises three information streams: firstly, the real-time optimal segmentation code output of the segmentation tracking module; secondly, a single sample learning information extraction moduleOutput of (2)Thirdly, outputting the image frame sequence sensing feature extraction module;
f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction moduleOutput of (2)Secondly, outputting the image frame sequence sensing feature extraction module;
f3: the memory library stores semantic information by setting dynamic parameters into regional semantic comparison and pixel semantic aggregation, and stores the semantic informationByEach target feature is composed;
f4: the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; second, the embedded representation of the memory bank output;
f5: target segmentation decoding module slave memory bankAll class feature representations derived in (1) are expressed in real tensor form
F6: further onThe global context of the front and back frame sequences between the images is captured, so that the representability of semantic understanding can be enriched;
f7: and the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.
Furthermore, the steps S1-S6 are combined with the pre-trained vectorization expression, and model parameters specific to the target object are designed to learn and generalize the appearance information of the target object in the reasoning process, so that the existing optimal dynamic target tracking is realized.
Compared with the performance comparative analysis of other similar optimal methods (see the following table), the experimental data set adopts the intelligent manufacturing production scene video target tracking data set constructed by the people and comprises 50 video segments. The evaluation index includes average JacobianAnd a boundaryScore, and total scoreThe contrast algorithms include STM, FEELVOS, preMVOS, FRTM, siamRCNN, and AGSSVOS.
The embodiment is as follows:
the overall method comprises the following implementation steps:
1. and constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model.
2. And constructing an image frame sequence perception feature extraction module.
5. Single sample learning information extraction module based on first frame imageAnd a segmentation tracking moduleAnd (4) reasoning on the segmentation of the subsequent frame image.
6. Multi-channel object segmentation unitThe module is to divide the tracking moduleOutput optimal partition encoding as input, partition decoderThe final output of (1) is a multi-channel target semantic segmentation result.
The general step 1 is that the semantic segmentation data set construction of the video image target object and the image semantic segmentation model definition are carried out.
Step one, constructing a semi-supervised image segmentation data set. A typical video data set consists of a series of multiple frames of images in chronological order, denoted as
Representing a video data set, whereinIn the representative videoFrame time series images of whichIn a collectionRepresentsEach marked division label corresponding toFrame of a pictureTo distinguish between foreground objects and image background. The video image segmentation dataset therefore comprisesSet of "image-label data" pairsWhereinAlso comprisesUnsupervised image data is framed.
And step two, in the image segmentation data set of the video target tracking time sequence, the target object is only defined by the labels of the reference target foreground and background segmentation labels given in the first frame. The process belongs to small sample learning, which is to give semantic segmentation labeling labels only for the first frame of image of a specific video sequence, and then to carry out inference work of segmentation targets in each subsequent frame. The representation of the video sequence data sets is in chronological order
WhereinRepresenting the latest frame of image at the current time. Then in this application scenario, onlyThis only one image frame is the annotation data with the real value,all other frames in (1) belong to unsupervised data, meaning thatUnsupervised image data ofTherefore, the final representation form of the data set of the invention is
Step three, based on the characteristics of the data set, defining a video target segmentation framework asWhereinThe parameters which represent the learnable parameters are obtained through learning in the model training process. NetworkAcquiring current image Ima and segmenting and tracking moduleTo output of (c). Although it does notIs not related to the target itself, but it does soConditional on it integrating information about the target object, coded into its parametersIn whichIs a convolution kernel that is a function of the original,weights for convolutional layers of kernel size K are formed, thus realizing the integration of information about the target object, encoded into its parametersThe object of (1).
Step four, the video target tracking is realized as an end-to-end network by adopting a video image segmentation method, as shown in figure 1, the video target tracking is realized by an image frame sequence perception feature extraction module and a single sample learning information extraction moduleSegmentation tracking moduleAnd a multi-channel object segmentation unitAnd (4) forming.Represents network parameters learned during offline training, andare parameters predicted by the single sample learning information extraction module during inference.
The above general step 2, the image frame sequence perception feature extraction module is constructed
Step one, the image frame sequence perception feature extraction module comprises a pixel associated information gathering unitPixel classifierAttention mechanism unitFirstly, constructing a pixel associated information gathering unitNASN is adopted to form a backbone network for pixel associated information gathering unitWherein the network structure of NASN is constructed using a neural structure search framework. Time-series image frame Ima is inputted toRepresenting aggregated features as pixel-associated informationThen theIs output to three directions, one is input to a multi-channel object segmentation unitThe second memory bank of (1) is input into the segmentation tracking moduleThirdly, the learning information is input into a single sample learning information extraction module(see FIG. 1, detailed in subsequent steps). Wherein the multi-channel target segmentation unit uses three residual blocks with spatial step size s =16. These features are first fed by an additional conv. At the same time input intoBefore the start of the operation of the device,is reduced to C =512.
Step two, constructing a pixel classifierPixel classifierThe input of (a) is a single-sample true value labeling label Yseg 1 By labeling the input target real value with a label Yseg 1 And coding to predict the other input image segmentation true value labels of the single-sample learning information extraction module. Pixel classifierThe tensor to depth feature mapping isWhere H, W, and D are the height, width, and dimensions of the segmentation tracking module features, and s is the feature stride. Pixel classifierImplemented as a hole convolutional network.
Step three, attention mechanism unitThe target true value is marked with Yseg as an input to construct an attention mechanism unit, and the mapping tensor is According to a convergence function Mapping tensor element values are generated. WhereinOr alternativelyIs a set of feature segmentation label pairs (x) with the size of Q t ,Tseg t ). Including a single partitioned real value labeled frame (x) 1 ,Yseg 1 ) The method achieves the purpose of automatically labeling a new image label of the next frame by adopting an inference method in a video sequence so as to realize a feature segmentation label labeling pair (x) containing an additional frame t ,Yseg t ) Thereby expanding the small sample learning data set. The scalar λ is a regularization parameter obtained by learning. Note that the force mechanism unit is implemented as a 6 hidden layer convolutional network. Similar toAndand sharing the feature mapping tensor structure unit.
Step one, constructed single sample learning information extraction moduleIs continuously microminiature, it will Minimization is Is thatContinuously differentiable function of (a). And (3) applying the steepest descent iteration to solve so as to obtain compromise between the precision and the efficiency. The optimization iterative computation method is expressed as: whereinIndicating gradient operation, T indicating transpose, <' > indicating convolution operation, T indicating T-th frame, x t The gathering characteristic of pixel related information, yseg, of the image representing the t-th frame t The segmentation label representing the t-th frame,indicating the result of the i-th iteration for the t-th frame imageValue of,indicating the i +1 th iteration calculated for the t-th frame imageThe value is obtained. Single sample learning information extraction moduleIs inputted byThe output isThe weights of convolutional layers with kernel size K are formed, information about the target object is integrated,are parameters predicted by the single-sample learning information extraction module during inference.
And step two, all the calculation of the optimization iterative calculation method is realized by adopting standard neural network operation. Being continuously differentiable, it is possible for all network parametersSegmented tracking module parameters obtained after i iterationsAnd is also differentiable. The single sample learning information extraction module is realized as a network moduleAs mentioned above:is a set of feature segmentation label pairs (x) with the number of Q t ,Yseg t ) From a given initializationN steepest descent iterations are started (the method is as described in step one). Since the steepest descent convergence is fast and efficient, the expected convergence effect is achieved by setting the number of iterations N =5 during training and reasoning.
Step three, obtaining parameters in the last stepOn the basis, an optimization-based formula is further used Tracking module parameters for segmentationTimely updating is carried out by combining with a new input subsequent frame sample, and a new frame image input subsequently is added to the data setAnd then iterative optimization calculation is applied, the new optimization iteration times are set to be N =2, so that a very ideal updating effect can be achieved, the calculated amount is reduced to the minimum, and real-time processing can be realized.
Step four, the parameters of the segmentation tracking module obtained by the iterative prediction of the first frame firstly through the steepest descentThus, the single-sample learning information extraction module is constructedCan be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combinedValue updating, applying to downstream processing links as a predictive segmentation tracking moduleThe segmentation code used to obtain the video tracking target is finally provided as input to the segmentation decoder.
Step one, constructed segmentation tracking moduleThe structure is implemented as a mapping as followsSingle sample learning information extraction moduleOutput result of (2)AsAs a parameter in the segmentation tracking module. The structure of the segmentation tracking module maps the input C-dimensional depth features x to D-dimensional codes with the same spatial resolution H × W object segmentation by training. To ensureIs continuously differentiable, so that the segmentation tracking module is constructed in the form ofWhereinThe resulting convolution kernel has a size of K. The depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted.
Step two, the model parameter in the segmentation tracking module uses a single sample learning information extraction moduleTracking modules by splittingOutput of (3) and generated true value labeling tagObtained by minimizing the square error between the two, obtained by means of an attention mechanism unitProvided importance per pixel weightAnd (4) weighting.
Step three, segmenting tracking moduleThe method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of inter-pixel information can be extracted with the simplest amount of computation. The number of output channels D is set to 32.
The above-mentioned overall step 5, the single sample learning information extraction module based on the first frame imageAnd a segmentation tracking moduleAnd (4) reasoning on the segmentation of the subsequent frame image.
Step one, baseSegmenting tracking module parameters from the first frame image of the video sequenceIs directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation, in which the true value is marked with the label Yseg 1 The feature information of the semantic segmentation of the target object is defined. Is shown asNamely, the learning information is obtained through calculation of a single sample learning information extraction module.
Step two, image annotation of video sequence "(x) 1 ,Yseg 1 ) Pairs constitute training samples for learning to segment a given target. However, the training sample can only be given in the inference stage, and therefore the training sample belongs to the small sample learning problem in video object segmentation, and only a single first frame true value marks the sample actually, strictly speaking, the training sample is a single sample learning problem. Single sample learning information extraction moduleLabeling from a single true value "(x) 1 ,Yseg 1 ) Pair generation segmentation tracking moduleThe parameters required in (1)The implementation approach is to minimize the supervised learning objectiveWhereinFollowing the following rules
Step three, based on the imagesElement associated information gathering unitFor input imageThe deep embedded feature of (a) indicates operation, wherein the backbone network employs a DenseNet architecture. Segmentation tracking moduleThe segmentation of the output target object in the initial frame is learned. Given a new frame Ima in the reasoning process, the object is segmented intoInstant segmentation tracking moduleIs applied to the new frame Ima to generate a first new segmentation inference. Due to the strong correlation of successive video frames, a robust prediction of the segmentation of the upcoming frame can be ensured by predicting the segmentation tracking module parameters based on the previous step by directly minimizing the segmentation error in the first frame.
Step four, marking Yseg by using the real value of the first frame 1 (i.e. correspond to) InWhereinAs a label in our single sample learning information extraction module, in conjunction with subsequent unsupervised image data A trainable label generator is introducedThe real value labeling segmentation label Yseg is used as an input and predicts the label (namely an image sequence) of a data set used for learning training by a single sample learning information extraction moduleThe label for the single sample learning information extraction module is based onAs input followed byInferred predictive). Thus, the split tracking module parameters are predicted as
Fifthly, labeling a label Yseg for the real value 1 Perform code generationTo allow partitioning of the tracking moduleA richer representation of the object segmentation is predicted in the test frame.
Step six, aiming at solving the problem of unbalanced training data set, setting higher weight for a target area; setting lower weight for field-of-view blurred regions, by means of attention mechanism unitAdjusting loss valueUsing a true value label Yseg as input, and predicting lossThe importance weight of each pixel in the image. The method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.
The above general step 6, multi-channel object segmentation unitThe module implements a segment tracking moduleThe output optimal division code is used as input, and the target division unitThe final output of (2) is a multi-channel target semantic segmentation result.
Step one, multi-channel target segmentation unitIncludes three information streams: firstly, the real-time optimal segmentation code output of the segmentation tracking module; secondly, a single sample learning information extraction moduleOutput of (2)And thirdly, outputting the image frame sequence sensing feature extraction module. Multi-channel object segmentation unitComprises two parts of an object segmentation decoding module and a memory bank, wherein the memory bank construction process comprises the steps from two to threeBy way of introduction, the multi-channel object segmentation unit configuration is introduced at steps four through seven.
Step two, firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction moduleOutput of (2)The other is the output of the image frame sequence perception feature extraction module, and the information streams contain the pixel-level shallow and deep semantic features of the input image. The memory bank functions as: the context semantics of the target objects between target object components in the image and between the front frame sequence and the back frame sequence are very important for understanding the relevance between pixels, the semantic aggregation construction of the pixels is realized by selectively memorizing the characteristic information of the input image sequence by the memory library, the semantic understanding is enhanced by utilizing the context information of the time sequence image frames in the memory library for semantic segmentation, and a large amount of regional semantic segmentation references are provided for constructing the memory library.
Step three, the memory base stores semantic information by setting dynamic parameters as regional semantic comparison and pixel semantic aggregation, and the memory baseByA target feature component, i.e.Each target feature corresponds to a category.Each term of (a) represents the number of the observed images Ima in the whole learning phaseGlobal region of individual classDomain aware representationRepresentA vector of dimension real values. At each training stage, the memory base updates the new feature learning results. Specifically, the current depth information of the image ImaBy passing Smooth update to memory representationWherein the element belongs to the hyper-parameter of the memory bank, and the value range is 0 < element < 0.06. When it comes toUpdate when a category appears in the image Ima and the classification confidence is greater than a threshold
Step four, next, the structure of the target segmentation decoding module is introduced (step four to step seven), and the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; the second is the embedded representation of the memory bank output. The target segmentation decoding module compresses the embedded representations from the memory base into a compact set of representative sample embedded representations. For each categoryElement(s)All the features in the network are subjected to multi-classification of 7-layer convolutional network to obtain the network withIs a characteristic representationMatrix representation of dimensional vectorsUsing multiple feature representations for each class (i.e. using multiple feature representations for each class)) To account for feature variations within a class.
Step five, the target segmentation decoding module slave memory bankAll class feature representatives derived in (1) are expressed in the form of real tensors Then have characteristics for eachWherein W and H represent the width and height of the image, respectively, using an embedded representationComputing its depth semanticsRepresentation of an algorithmWhereinIs thatDimension matrix, tensorAndare respectively reduced and flattened into matrix form representationIn order to achieve a high-speed calculation,which represents the operation of transposition by means of a transposition operation,represents a matrix multiplication in whichIs a hyperbolic tangent function expressed byEach item in (1) reflectsEach row (i.e., feature) of anda canonical generalized distance measure between each column (i.e., feature representation) in (a). Depth based semanticsAnd feature embeddingComputing a feature representationWhereinTo representIs characterized by adjusting the tensor structure into
Step six, thenAnd original characteristicsIs connected into Not only lie inIn encoding local context semantics within an image, further inThe global context of the front and back frame sequences between the captured images can enrich the representability of semantic understanding.
Step seven, the classification of the multi-channel target pixels output by the target segmentation decoding module comprises that the trunk network adopts DenseNet to map the input image Ima into convolution expressionAnd (3) adopting a convolutional neural network with the hidden layer depth of 9 to realize classification, generating a classification perception attention diagram from feature embedding, and adopting 5 multiplied by 5 as a convolutional kernel. Memory bankAll region patterns are stored in the training data. And the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.
It is noted that, herein, relational terms such as first and second (a, b, etc.) and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.
Claims (8)
1. A video target tracking method for single sample learning is characterized in that: the method comprises the following operation steps:
s1: model construction: constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model;
s2: the module structure is as follows: constructing an image frame sequence perception feature extraction module;
s3: sample(s)The module structure is as follows: single sample learning information extraction moduleThe structure of (1);
s4: the tracking module is constructed as follows: segmentation tracking moduleThe structure of (1);
s5: image segmentation reasoning: single sample learning information extraction module based on first frame imageAnd a segmentation tracking moduleReasoning on the segmentation of the subsequent frame image;
2. The method of claim 1, wherein the method comprises: the step S1 specifically comprises the following operation steps:
a1: constructing a semi-supervised image segmentation data set;
a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of reference target foreground and background segmentation labels given in a first frame, and for a specific video sequence, semantic segmentation labels of a first frame image are often only given, and then inference work of segmenting a target in each subsequent frame is needed;
a3: based on the above dataset features, we define a video object segmentation framework asWhereinRepresenting learnable parameters, obtained by learning in the model training process;
a4: the video target tracking is realized as an end-to-end network by adopting a video image segmentation method.
3. The method for tracking the video target through single-sample learning according to claim 1, wherein: the step S2 specifically comprises the following operation steps:
b1: the image frame sequence perception feature extraction module comprises a pixel correlation information convergence unitPixel classifierAttention mechanism unit
B2: structured pixel classifierPixel classifierThe input is a single sample real value labeling label Yseg 1 Through a pair ofInput target true value labeling label Yseg 1 Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;
b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) comprising an additional frame is realized t ,Yseg t ) Thereby expanding the small sample learning data set.
4. The method of claim 1, wherein the method comprises: the step S3 specifically comprises the following operation steps:
c1: solving through applying steepest descent iteration so as to obtain compromise in two aspects of precision and efficiency;
c2: all computations are performed using standard neural network operations, from a given initializationStarting to execute N steepest descent iterations, wherein convergence of steepest descent is fast and efficient, and the expected convergence effect can be achieved by setting the iteration number N =5 in the training and reasoning period;
c3: setting N =2 in the new optimization iteration times, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;
c4: a single sample learning information extraction module is constructedCan be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combinedValue updating, applying to downstream processing links as a predictive segmentation tracking moduleThe segmentation code used for obtaining the video tracking target is finally provided as an input to the segmentation decoder.
5. The method of claim 1, wherein the method comprises: the step S4 specifically comprises the following operation steps:
d1: the depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted;
d2: model parameters in the segmentation and tracking module are extracted by using a single sample learning information extraction moduleTracking modules by pair segmentationOutput of (3) and generated true value labeling tagObtained by minimizing the square error between the two, obtained by means of an attention mechanism unitProvided importance per pixel weightWeighting;
d3: segmentation tracking moduleThe method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D r =2, a wider range of pixel mutual information can be extracted under the condition of the simplest calculation amount.
6. The method for tracking the video target through single-sample learning according to claim 1, wherein: the step S5 specifically comprises the following operation steps:
e1: first frame image based on video sequence, segmentation tracking module parameterIs directed to the initial input image Ima 1 Labeling label Yseg by combining given real value 1 Obtained by calculation;
e2: image annotation of video sequences "(x) 1 ,Yseg 1 ) For "constitute training samples for learning to segment a given target;
e3: predicting the segmentation tracking module parameters based on the last step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;
e4: annotating Yseg with first frame true values 1 (i.e. correspond toIn (1)Wherein) As a label in our single sample learning information extraction module;
e5: by labeling the actual value with the label Yseg 1 Carry out code generationTo allow partitioning of the tracking modulePredicting a richer representation of the target segmentation in the test frame;
e6: the method can guide the single-sample learning information extraction module to optimize learning and converge at the highest speed, and realize the real-time optimal segmentation coding output of the segmentation tracking module.
7. The method of claim 1, wherein the method comprises: the step S6 specifically comprises the following operation steps:
f1: multi-channel object segmentation unitThe input of (a) includes three information streams: one is real-time optimal segmentation code output of a segmentation tracking module; secondly, a single sample learning information extraction moduleOutput of (2)Thirdly, outputting the image frame sequence perception feature extraction module;
f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction moduleOutput of (2)Secondly, outputting the image frame sequence sensing feature extraction module;
f3: the memory library stores semantic information by setting dynamic parameters into regional semantic comparison and pixel semantic aggregation, and stores the semantic informationByEach target feature is composed;
f4: the input information stream of the target segmentation decoding module comprises two branches: one is real-time optimal segmentation code output of a segmentation tracking module; second, the embedded representation of the memory bank output;
f5: target segmentation decoding module slave memory bankAll class feature representatives derived in (1) are expressed in the form of real tensors
F6: further onThe global context of the front and back frame sequences between the images is captured, so that the representability of semantic understanding can be enriched;
f7: and the memory base reserves compressed global feature representative representation in an inference stage and outputs a video target tracking result through feature information fusion.
8. The method of claim 1, wherein the method comprises: and combining the pre-trained vectorization expression in the steps S1-S6, designing model parameters specific to the target object in the reasoning process to study, generalizing the appearance information of the target object, and realizing the existing optimal dynamic target tracking.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211108906.6A CN115393400A (en) | 2022-09-13 | 2022-09-13 | Video target tracking method for single sample learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211108906.6A CN115393400A (en) | 2022-09-13 | 2022-09-13 | Video target tracking method for single sample learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115393400A true CN115393400A (en) | 2022-11-25 |
Family
ID=84125715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211108906.6A Pending CN115393400A (en) | 2022-09-13 | 2022-09-13 | Video target tracking method for single sample learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115393400A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115965758A (en) * | 2022-12-28 | 2023-04-14 | 无锡东如科技有限公司 | Three-dimensional reconstruction method for image cooperation monocular instance |
-
2022
- 2022-09-13 CN CN202211108906.6A patent/CN115393400A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115965758A (en) * | 2022-12-28 | 2023-04-14 | 无锡东如科技有限公司 | Three-dimensional reconstruction method for image cooperation monocular instance |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Object detection with deep learning: A review | |
Fiaz et al. | Handcrafted and deep trackers: Recent visual object tracking approaches and trends | |
Kim et al. | Multi-object tracking with neural gating using bilinear lstm | |
CN110298404B (en) | Target tracking method based on triple twin Hash network learning | |
Cui et al. | Efficient human motion prediction using temporal convolutional generative adversarial network | |
CN112926396B (en) | Action identification method based on double-current convolution attention | |
CN110781262B (en) | Semantic map construction method based on visual SLAM | |
CN109443382A (en) | Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network | |
CN111931602A (en) | Multi-stream segmented network human body action identification method and system based on attention mechanism | |
Bilal et al. | A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes | |
CN112884742A (en) | Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method | |
Hakim et al. | Survey: Convolution neural networks in object detection | |
Mocanu et al. | Single object tracking using offline trained deep regression networks | |
Saqib et al. | Intelligent dynamic gesture recognition using CNN empowered by edit distance | |
Ning et al. | Deep Spatial/temporal-level feature engineering for Tennis-based action recognition | |
CN116912804A (en) | Efficient anchor-frame-free 3-D target detection and tracking method and model | |
CN115393400A (en) | Video target tracking method for single sample learning | |
Couprie et al. | Joint future semantic and instance segmentation prediction | |
Silva et al. | Online weighted one-class ensemble for feature selection in background/foreground separation | |
Ben Mahjoub et al. | An efficient end-to-end deep learning architecture for activity classification | |
Qiu et al. | Using stacked sparse auto-encoder and superpixel CRF for long-term visual scene understanding of UGVs | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN110378938A (en) | A kind of monotrack method based on residual error Recurrent networks | |
CN115100740A (en) | Human body action recognition and intention understanding method, terminal device and storage medium | |
CN115034459A (en) | Pedestrian trajectory time sequence prediction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |