CN115393400A

CN115393400A - Video target tracking method for single sample learning

Info

Publication number: CN115393400A
Application number: CN202211108906.6A
Authority: CN
Inventors: 曹东
Original assignee: Wuxi Dongru Technology Co ltd
Current assignee: Wuxi Dongru Technology Co ltd
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2022-11-25

Abstract

The invention discloses a video target tracking method for single sample learning, which comprises the steps of constructing a video image target object semantic segmentation data set, defining an image semantic segmentation model, constructing an image frame sequence perception feature extraction module and a single sample learning information extraction module

The structure of (1), a division tracking module

The image segmentation of the subsequent frame is inferred, and the module is used for tracking the segmentation

The output optimal division code is used as input, the division decoder

The final output of (1) is a multi-channel target semantic segmentation result. The video target tracking method for single-sample learning combines the vectorization expression of pre-training, designs model parameter learning specific to the target object in the reasoning process, better generalizes the appearance information of the target object, and realizes the existing optimal dynamic target tracking, thereby realizing the optimal autonomous action planning of the intelligent robot.

Description

Video target tracking method for single sample learning

Technical Field

The invention relates to the field of intelligent manufacturing and machine vision, in particular to a video target tracking method for single-sample learning.

Background

A video target tracking method for single sample learning is a method for carrying out dynamic target tracking, in an intelligent manufacturing production scene of a digital factory, an intelligent robot needs to carry out dynamic target tracking on a moving target object to be controlled in the process of executing a given production task and realizing autonomous action planning so as to realize downstream real-time accurate control, therefore, target object tracking based on video stream is one of key technologies for effectively realizing the application, the realization way of video target tracking can achieve the purpose by carrying out semantic segmentation on the target object in an image frame by frame on a video sequence, and further realizing the separation of the foreground and the background of the target object, and the autonomous action planning of the intelligent robot is realized.

The existing video target tracking method for single-sample learning has certain disadvantages when in use, the existing work enables a semantic segmentation network to adapt to a video target tracking task through online fine tuning, however, the method easily causes overfitting of the appearance of a target object set for the first frame and high delay caused by exceeding the conventional calculation complexity, other follow-up methods integrate an appearance model specific to the target object into a segmentation model, the running time is improved, end-to-end learning is realized, and in order to generalize a new target object to be tracked, a feature matching technology based on feature embedding is often difficult to effectively realize through deep learning.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a video target tracking method for single-sample learning, which not only combines the vectorization expression of pre-training, but also designs model parameter learning specific to a target object in the reasoning process, better generalizes the appearance information of the target object, and realizes the existing optimal dynamic target tracking, thereby realizing the optimal autonomous action planning of an intelligent robot and effectively solving the problems in the background technology.

(II) technical scheme

In order to achieve the purpose, the invention adopts the technical scheme that: a video target tracking method for single sample learning comprises the following operation steps:

s1: model construction: constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model;

s2: the module structure is as follows: constructing an image frame sequence perception feature extraction module;

s3: sample module construction: single sample learning information extraction module

The structure of (2);

s4: the tracking module is constructed as follows: segmentation tracking module

The structure of (2);

s5: image segmentation reasoning: single sample learning information extraction module based on first frame image

And a segmentation tracking module

Reasoning on the segmentation of the subsequent frame image;

s6: and (3) outputting a result: multi-channel object segmentation unit

The module is to divide the tracking module

The output optimal division code is used as input, the division decoder

The final output of (1) is a multi-channel target semantic segmentation result.

As a preferred technical solution of the present application, the step S1 specifically includes the following steps:

a1: constructing a semi-supervised image segmentation data set;

a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of reference target foreground and background segmentation labels given in a first frame, and for a specific video sequence, semantic segmentation labels of a first frame image are often only given, and then inference work of segmenting a target in each subsequent frame is needed;

a3: based on the above dataset features, we define a video object segmentation framework as

Wherein

The parameters which can be learned are represented and obtained through learning in the model training process;

a4: the video target tracking is realized as an end-to-end network by adopting a video image segmentation method.

As a preferred technical solution of the present application, the step S2 specifically includes the following steps:

b1: the image frame sequence perception feature extraction module comprises a pixel correlation information convergence unit

Pixel classifier

Attention mechanism unit

B2: structured pixel classifier

Pixel classifier

The input is a single sample real value labeling label Yseg ₁ By labeling the input target real value with a label Yseg ₁ Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;

b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) comprising an additional frame is realized _t ,Yseg _t ) Thereby expanding the small sample learning data set.

As a preferred technical solution of the present application, the step S3 specifically includes the following steps:

c1: solving through applying steepest descent iteration so as to obtain compromise in two aspects of precision and efficiency;

c2: all computations are performed using standard neural network operations, from a given initialization

Begin to perform N steepest descent iterations, training and reasoning because convergence of steepest descent is fast and efficientDuring the period, the expected convergence effect can be achieved by setting the iteration times N = 5;

c3: setting N =2 for the new optimization iteration number, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;

c4: a single sample learning information extraction module is constructed

Can be applied to the test frame sequence which is input in time sequence subsequently, and the real-time of the previous step is combined

Value updating, applying to downstream processing links as a predictive split-track module

The segmentation code used for obtaining the video tracking target is finally provided as an input to the segmentation decoder.

As a preferred technical solution of the present application, the step S4 specifically includes the following steps:

d1: the depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted;

d2: model parameters in the segmentation and tracking module are extracted by using a single sample learning information extraction module

Tracking modules by pair segmentation

Output of (3) and generated true value labeling tag

Obtained by minimizing the square error between the two, obtained by means of an attention mechanism unit

Provided importance per pixel weight

Weighting;

d3: segmentation tracking module

The method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D _r =2, a wider range of pixel mutual information can be extracted under the condition of the simplest calculation amount.

As a preferred technical solution of the present application, the step S5 specifically includes the following operation steps:

e1: first frame image based on video sequence, segmentation tracking module parameter

Is directed to the initial input image Ima ₁ Labeling label Yseg by combining given real value ₁ Obtained by calculation;

e2: image annotation of video sequences "(x) ₁ ,Yseg ₁ ) For "constitute training samples for learning to segment a given target;

e3: predicting the segmentation tracking module parameters based on the previous step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;

e4: annotating Yseg with first frame true values ₁ (i.e. correspond to)

In (1)

Wherein

) As a label in our single sample learning information extraction module;

e5: by marking the actual valueLabel Yseg ₁ Carry out code generation

Thereby allowing partitioning of the tracking module

Predicting a richer representation of the object segmentation in the test frame;

e6: the method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.

As a preferred technical solution of the present application, the step S6 specifically includes the following operation steps:

f1: multi-channel object segmentation unit

Includes three information streams: one is real-time optimal segmentation code output of a segmentation tracking module; secondly, a single sample learning information extraction module

Output of (2)

Thirdly, outputting the image frame sequence perception feature extraction module;

f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: one is a single sample learning information extraction module

Output of (2)

Secondly, outputting the image frame sequence sensing feature extraction module;

f3: the memory library stores semantic information by setting dynamic parameters into regional semantic comparison and pixel semantic aggregation, and stores the semantic information

By

Each target feature is composed;

f4: the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; second, the embedded representation of the memory bank output;

f5: target segmentation decoding module slave memory bank

All class feature representations derived in (1) are expressed in real tensor form

F6: further on

The global context of the front and back frame sequences between the images is captured, so that the representability of semantic understanding can be enriched;

f7: and the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.

As a preferred technical solution of the present application, the steps S1 to S6 are combined with a pre-trained vectorization expression, and a model parameter specific to a target object is designed to learn in an inference process, so as to generalize appearance information of the target object, thereby implementing the existing optimal dynamic target tracking.

(III) advantageous effects

Compared with the prior art, the invention provides a video target tracking method for single-sample learning, which has the following beneficial effects: the video target tracking method for single-sample learning not only combines the vectorization expression of pre-training, but also designs model parameter learning specific to the target object in the reasoning process, thereby better generalizing the appearance information of the target object and realizing the optimal dynamic target tracking in the prior art, thereby realizing the optimal autonomous action planning of the intelligent robot. Video target tracking refers to a process of positioning a moving target object in a video time sequence, including tracking of a single target and multiple targets, and related problems include occlusion and motion blur, viewpoint and scale change, background and illumination change, moving of the target object out of a view field in the tracking process, and the like. The small sample learning method is a machine learning method which adopts little labeled data combined with a large amount of unlabeled data for classification or regression, compared with a mainstream deep learning method, the method can effectively train a model only by means of a large amount of labeled data sets so as to achieve acceptable generalization performance, the small sample learning method effectively solves the problem of lack of labeled data sets in an actual production scene, obviously reduces the labor cost and time cost of data labeling, and obviously improves the robustness, generalization performance and the like of an algorithm system. Compared with other existing similar single-sample learning methods, the method has the advantages that target information extracted in the training and learning stage is richer, key association information of front and rear frames of front and rear video sequence images and context semantic information between far and near pixels of the images can be effectively memorized, and accuracy of multi-target construction is improved; furthermore, the calculation time consumption of the single-sample learning method in the model inference stage is much lower than that of other existing models, and the real-time performance of algorithm inference is improved.

1. The method successfully realizes the video target tracking of single-sample learning, and the performance is due to the existing other similar methods. Segmentation tracking module

Learning an initial segmentation truth value labeling label for predicting the target object from the first frame. The actual value is labeled by the network

Refining and refining, the network constructed by the invention has strong learning segmentation prior capability. And also

It is not limited to operating on approximate target annotation tags to perform conditional segmentation of target objects. Yseg ₁ As a label in a single sample learning information extraction module, a trainable pixel classifier is introduced

Through the automatic machine learning realization of a learning internal single-sample learning information extraction module, instead of only directly and simply using the first frame true value for marking, the segmentation tracking module can realize the prediction multi-channel segmentation, thereby being capable of providing strong associated target perception information,

the segmentation prediction is more accurate.

2. STM realizes video object segmentation using space-time memory network, provides a novel solution for object segmentation of semi-supervised video, and can change video frames with object masks into richer intermediate prediction according to the nature of the problem. FEELVOS proposes fast end-to-end embedding learning for video object segmentation. PreMVOS first generates an accurate set of object segmentation mask proposals for each video frame, then selects and merges these proposals into an accurate and time-consistent pixel object tracking across the video sequence in a designed manner, specifically addressing the problems involved in segmenting multiple objects in a video sequence. FRTM proposes a new type of VOS architecture consisting of two network components, the target appearance model consisting of a lightweight module that learns in the inference phase using fast optimization techniquesA coarse but robust object segmentation is predicted. SiamRCNN, in combination with a dynamic programming-based algorithm, models the complete history of objects to be tracked and potentially interfering objects using re-detection of the first frame template and the previous frame prediction. The ags vos segments a path of all object instances simultaneously by segmenting multiple object instance-independent and instance-specific modules in a feed-forward path, with information from both modules being passed through an attention-directed fusion decoder. Compared with other similar methods, the method realizes the optimal performance and the average Jacobian of the existing optimal method

Score increased by 2.9, margin

The score is improved by 2.5

The aspect is improved by 3.4% compared with an optimal method STM, and the performance of the novel method in a video target tracking algorithm is optimal as proved by a test result.

3. In the algorithmic reasoning process, a test sequence is given, and together with a first frame label, an initial training set including single sample pairs is firstly created for the single sample learning information extraction module. A feature map extracted from the first frame. Then, the single sample learning information extraction module predicts the parameters of the segmentation tracking module through iterative learning

The initial estimation of the segmentation tracking module is set to be all zero so as to simplify the calculation complexity and improve the real-time performance of the system. Then the learned model is

Is applied to the subsequent test frame Ima ₂ To obtain the label code

To adapt to a sceneWe further update our segmentation tracking module with frame information from processing based on a memory-base approach. We ensure that the memory is updated by deleting stale samples. The first frame is reserved, if a video sequence comprises a plurality of targets, each target is independently processed in parallel, a re-detection first frame template and the predication of the previous frame are utilized to model complete historical tracking and potential interference objects of two objects, the optimal tracking decision is realized by the method, and the re-detection tracking of the target object after long-time occlusion is optimized. Experiments verify that the performance of the algorithm is optimal.

The whole video target tracking method for single-sample learning is simple and convenient to operate, and the using effect is better than that of a traditional mode.

Drawings

Fig. 1 is a schematic diagram of an overall algorithm flow of a single-sample learning video target tracking method according to the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and the detailed description, but those skilled in the art will understand that the following described embodiments are some, not all, of the embodiments of the present invention, and are only used for illustrating the present invention, and should not be construed as limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The instruments used are not indicated by the manufacturer, and are all conventional products available by commercial purchase.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, a single-sample learning video target tracking method includes the following operation steps:

s1: model construction: constructing a video image target object semantic segmentation data set and defining an image semantic segmentation model;

The structure of (2);

s4: the tracking module is constructed as follows: segmentation tracking module

The structure of (1);

s5: and (3) image segmentation reasoning: single sample learning information extraction module based on first frame image

And a segmentation tracking module

Reasoning on the segmentation of the subsequent frame image;

s6: and outputting a result: multi-channel object segmentation unit

Will split the tracking module

Output optimal partition encoding as input, partition decoder

The final output of (1) is a multi-channel target semantic segmentation result.

Further, the step S1 specifically includes the following steps:

a1: constructing a semi-supervised image segmentation data set;

a2: in an image segmentation data set of a video target tracking time sequence, a target object is only defined by labels of segmentation and labeling of a reference target foreground and a reference target background given in a first frame, a semantic segmentation and labeling label of a first frame image of a specific video sequence is often given only, and then inference work of segmenting a target in each subsequent frame is required;

Wherein

Representing learnable parameters, obtained by learning in the model training process;

Further, the step S2 specifically includes the following operation steps:

b1: the image frame sequence perception feature extraction module comprises a pixel associated information gathering unit

Pixel classifier

Attention mechanism unit

B2: structured pixel classifier

PixelClassifier

The input is a single sample real value labeling label Yseg ₁ By labeling the actual value of the input target with a label Yseg ₁ Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;

b3: by adopting a reasoning method in a video sequence, the purpose of automatically labeling a new image label of a next frame is achieved, so that a feature segmentation label labeling pair (x) containing an additional frame is realized _t ,Yseg _t ) Thereby expanding the small sample learning data set.

Further, the step S3 specifically includes the following operation steps:

Starting to execute N steepest descent iterations, wherein convergence of steepest descent is fast and efficient, and the expected convergence effect can be achieved by setting the iteration number N =5 in the training and reasoning period;

c4: a single sample learning information extraction module is constructed

Value updating, applying to downstream processing links as a predictive segmentation tracking module

The segmentation code used to obtain the video tracking target is finally provided as input to the segmentation decoder.

Further, the step S4 specifically includes the following operation steps:

Tracking modules by pair segmentation

Output of (3) and generated true value labeling tag

Provided importance per pixel weight

Weighting;

d3: segmentation tracking module

Further, the step S5 specifically includes the following steps:

e3: predicting the segmentation tracking module parameters based on the last step by directly minimizing the segmentation error in the first frame may ensure robust segmentation prediction for the upcoming frame;

e4: annotating Yseg with first frame true values ₁ (i.e. correspond to

In

Wherein

) As a label in our single sample learning information extraction module;

e5: by labelling the true value with the label Yseg ₁ Perform code generation

Thereby allowing partitioning of the tracking module

Predicting a richer representation of the target segmentation in the test frame;

Further, the step S6 specifically includes the following steps:

f1: multi-channel object segmentation unit

The input comprises three information streams: firstly, the real-time optimal segmentation code output of the segmentation tracking module; secondly, a single sample learning information extraction module

Output of (2)

Thirdly, outputting the image frame sequence sensing feature extraction module;

f2: firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction module

Output of (2)

By

Each target feature is composed;

f5: target segmentation decoding module slave memory bank

F6: further on

Furthermore, the steps S1-S6 are combined with the pre-trained vectorization expression, and model parameters specific to the target object are designed to learn and generalize the appearance information of the target object in the reasoning process, so that the existing optimal dynamic target tracking is realized.

Compared with the performance comparative analysis of other similar optimal methods (see the following table), the experimental data set adopts the intelligent manufacturing production scene video target tracking data set constructed by the people and comprises 50 video segments. The evaluation index includes average Jacobian

And a boundary

Score, and total score

The contrast algorithms include STM, FEELVOS, preMVOS, FRTM, siamRCNN, and AGSSVOS.

The embodiment is as follows:

the overall method comprises the following implementation steps:

1. and constructing a semantic segmentation data set of a video image target object and defining an image semantic segmentation model.

2. And constructing an image frame sequence perception feature extraction module.

3. Single sample learning information extraction module

The structure of (3).

4. Segmentation tracking module

The structure of (1).

5. Single sample learning information extraction module based on first frame image

And a segmentation tracking module

And (4) reasoning on the segmentation of the subsequent frame image.

6. Multi-channel object segmentation unit

The module is to divide the tracking module

Output optimal partition encoding as input, partition decoder

The final output of (1) is a multi-channel target semantic segmentation result.

The general step 1 is that the semantic segmentation data set construction of the video image target object and the image semantic segmentation model definition are carried out.

Step one, constructing a semi-supervised image segmentation data set. A typical video data set consists of a series of multiple frames of images in chronological order, denoted as

Representing a video data set, wherein

In the representative video

Frame time series images of which

In a collection

Represents

Each marked division label corresponding to

Frame of a picture

To distinguish between foreground objects and image background. The video image segmentation dataset therefore comprises

Set of "image-label data" pairs

Wherein

Also comprises

Unsupervised image data is framed.

And step two, in the image segmentation data set of the video target tracking time sequence, the target object is only defined by the labels of the reference target foreground and background segmentation labels given in the first frame. The process belongs to small sample learning, which is to give semantic segmentation labeling labels only for the first frame of image of a specific video sequence, and then to carry out inference work of segmentation targets in each subsequent frame. The representation of the video sequence data sets is in chronological order

Wherein

Representing the latest frame of image at the current time. Then in this application scenario, only

This only one image frame is the annotation data with the real value,

all other frames in (1) belong to unsupervised data, meaning that

Unsupervised image data of

Therefore, the final representation form of the data set of the invention is

Step three, based on the characteristics of the data set, defining a video target segmentation framework as

Wherein

The parameters which represent the learnable parameters are obtained through learning in the model training process. Network

Acquiring current image Ima and segmenting and tracking module

To output of (c). Although it does not

Is not related to the target itself, but it does so

Conditional on it integrating information about the target object, coded into its parameters

In which

Is a convolution kernel that is a function of the original,

weights for convolutional layers of kernel size K are formed, thus realizing the integration of information about the target object, encoded into its parameters

The object of (1).

Step four, the video target tracking is realized as an end-to-end network by adopting a video image segmentation method, as shown in figure 1, the video target tracking is realized by an image frame sequence perception feature extraction module and a single sample learning information extraction module

Segmentation tracking module

And a multi-channel object segmentation unit

And (4) forming.

Represents network parameters learned during offline training, and

are parameters predicted by the single sample learning information extraction module during inference.

The above general step 2, the image frame sequence perception feature extraction module is constructed

Step one, the image frame sequence perception feature extraction module comprises a pixel associated information gathering unit

Pixel classifier

Attention mechanism unit

Firstly, constructing a pixel associated information gathering unit

NASN is adopted to form a backbone network for pixel associated information gathering unit

Wherein the network structure of NASN is constructed using a neural structure search framework. Time-series image frame Ima is inputted to

Representing aggregated features as pixel-associated information

Then the

Is output to three directions, one is input to a multi-channel object segmentation unit

The second memory bank of (1) is input into the segmentation tracking module

Thirdly, the learning information is input into a single sample learning information extraction module

(see FIG. 1, detailed in subsequent steps). Wherein the multi-channel target segmentation unit uses three residual blocks with spatial step size s =16. These features are first fed by an additional conv. At the same time input into

Before the start of the operation of the device,

is reduced to C =512.

Step two, constructing a pixel classifier

Pixel classifier

The input of (a) is a single-sample true value labeling label Yseg ₁ By labeling the input target real value with a label Yseg ₁ And coding to predict the other input image segmentation true value labels of the single-sample learning information extraction module. Pixel classifier

The tensor to depth feature mapping is

Where H, W, and D are the height, width, and dimensions of the segmentation tracking module features, and s is the feature stride. Pixel classifier

Implemented as a hole convolutional network.

Step three, attention mechanism unit

The target true value is marked with Yseg as an input to construct an attention mechanism unit, and the mapping tensor is

According to a convergence function

Mapping tensor element values are generated. Wherein

Or alternatively

Is a set of feature segmentation label pairs (x) with the size of Q _t ,Tseg _t ). Including a single partitioned real value labeled frame (x) ₁ ,Yseg ₁ ) The method achieves the purpose of automatically labeling a new image label of the next frame by adopting an inference method in a video sequence so as to realize a feature segmentation label labeling pair (x) containing an additional frame _t ,Yseg _t ) Thereby expanding the small sample learning data set. The scalar λ is a regularization parameter obtained by learning. Note that the force mechanism unit is implemented as a 6 hidden layer convolutional network. Similar to

And

and sharing the feature mapping tensor structure unit.

The overall step 3, the single sample learning information extraction module

Structure of (2)

Step one, constructed single sample learning information extraction module

Is continuously microminiature, it will

Minimization is

Is that

Continuously differentiable function of (a). And (3) applying the steepest descent iteration to solve so as to obtain compromise between the precision and the efficiency. The optimization iterative computation method is expressed as:

wherein

Indicating gradient operation, T indicating transpose, <' > indicating convolution operation, T indicating T-th frame, x _t The gathering characteristic of pixel related information, yseg, of the image representing the t-th frame _t The segmentation label representing the t-th frame,

indicating the result of the i-th iteration for the t-th frame image

Value of,

indicating the i +1 th iteration calculated for the t-th frame image

The value is obtained. Single sample learning information extraction module

Is inputted by

The output is

The weights of convolutional layers with kernel size K are formed, information about the target object is integrated,

are parameters predicted by the single-sample learning information extraction module during inference.

And step two, all the calculation of the optimization iterative calculation method is realized by adopting standard neural network operation. Being continuously differentiable, it is possible for all network parameters

Segmented tracking module parameters obtained after i iterations

And is also differentiable. The single sample learning information extraction module is realized as a network module

As mentioned above:

is a set of feature segmentation label pairs (x) with the number of Q _t ,Yseg _t ) From a given initialization

N steepest descent iterations are started (the method is as described in step one). Since the steepest descent convergence is fast and efficient, the expected convergence effect is achieved by setting the number of iterations N =5 during training and reasoning.

Step three, obtaining parameters in the last step

On the basis, an optimization-based formula is further used

Tracking module parameters for segmentation

Timely updating is carried out by combining with a new input subsequent frame sample, and a new frame image input subsequently is added to the data set

And then iterative optimization calculation is applied, the new optimization iteration times are set to be N =2, so that a very ideal updating effect can be achieved, the calculated amount is reduced to the minimum, and real-time processing can be realized.

Step four, the parameters of the segmentation tracking module obtained by the iterative prediction of the first frame firstly through the steepest descent

Thus, the single-sample learning information extraction module is constructed

The above general step 4, the segmentation tracking module

The structure of (3).

Step one, constructed segmentation tracking module

The structure is implemented as a mapping as follows

Single sample learning information extraction module

Output result of (2)

As

As a parameter in the segmentation tracking module. The structure of the segmentation tracking module maps the input C-dimensional depth features x to D-dimensional codes with the same spatial resolution H × W object segmentation by training. To ensure

Is continuously differentiable, so that the segmentation tracking module is constructed in the form of

Wherein

The resulting convolution kernel has a size of K. The depth feature map of the segmentation tracking module can be subjected to real-time lightweight operation, and rich coding information of target segmentation can be predicted.

Step two, the model parameter in the segmentation tracking module uses a single sample learning information extraction module

Tracking modules by splitting

Output of (3) and generated true value labeling tag

Provided importance per pixel weight

And (4) weighting.

Step three, segmenting tracking module

The method is realized as a convolution filter with kernel size K =5, a first middle hidden layer adopts hole convolution and has an expansion rate D _r =2, a wider range of inter-pixel information can be extracted with the simplest amount of computation. The number of output channels D is set to 32.

The above-mentioned overall step 5, the single sample learning information extraction module based on the first frame image

And a segmentation tracking module

And (4) reasoning on the segmentation of the subsequent frame image.

Step one, baseSegmenting tracking module parameters from the first frame image of the video sequence

Is directed to the initial input image Ima ₁ Labeling label Yseg by combining given real value ₁ Obtained by calculation, in which the true value is marked with the label Yseg ₁ The feature information of the semantic segmentation of the target object is defined. Is shown as

Namely, the learning information is obtained through calculation of a single sample learning information extraction module.

Step two, image annotation of video sequence "(x) ₁ ,Yseg ₁ ) Pairs constitute training samples for learning to segment a given target. However, the training sample can only be given in the inference stage, and therefore the training sample belongs to the small sample learning problem in video object segmentation, and only a single first frame true value marks the sample actually, strictly speaking, the training sample is a single sample learning problem. Single sample learning information extraction module

Labeling from a single true value "(x) ₁ ,Yseg ₁ ) Pair generation segmentation tracking module

The parameters required in (1)

The implementation approach is to minimize the supervised learning objective

Wherein

Following the following rules

Step three, based on the imagesElement associated information gathering unit

For input image

The deep embedded feature of (a) indicates operation, wherein the backbone network employs a DenseNet architecture. Segmentation tracking module

The segmentation of the output target object in the initial frame is learned. Given a new frame Ima in the reasoning process, the object is segmented into

Instant segmentation tracking module

Is applied to the new frame Ima to generate a first new segmentation inference. Due to the strong correlation of successive video frames, a robust prediction of the segmentation of the upcoming frame can be ensured by predicting the segmentation tracking module parameters based on the previous step by directly minimizing the segmentation error in the first frame.

Step four, marking Yseg by using the real value of the first frame ₁ (i.e. correspond to)

In

Wherein

As a label in our single sample learning information extraction module, in conjunction with subsequent unsupervised image data

A trainable label generator is introduced

The real value labeling segmentation label Yseg is used as an input and predicts the label (namely an image sequence) of a data set used for learning training by a single sample learning information extraction module

The label for the single sample learning information extraction module is based on

As input followed by

Inferred predictive). Thus, the split tracking module parameters are predicted as

Fifthly, labeling a label Yseg for the real value ₁ Perform code generation

To allow partitioning of the tracking module

A richer representation of the object segmentation is predicted in the test frame.

Step six, aiming at solving the problem of unbalanced training data set, setting higher weight for a target area; setting lower weight for field-of-view blurred regions, by means of attention mechanism unit

Adjusting loss value

Using a true value label Yseg as input, and predicting loss

The importance weight of each pixel in the image. The method guides the single-sample learning information extraction module to optimize learning and converge at the fastest speed, and realizes the real-time optimal segmentation code output of the segmentation tracking module.

The above general step 6, multi-channel object segmentation unit

The module implements a segment tracking module

The output optimal division code is used as input, and the target division unit

The final output of (2) is a multi-channel target semantic segmentation result.

Step one, multi-channel target segmentation unit

Includes three information streams: firstly, the real-time optimal segmentation code output of the segmentation tracking module; secondly, a single sample learning information extraction module

Output of (2)

And thirdly, outputting the image frame sequence sensing feature extraction module. Multi-channel object segmentation unit

Comprises two parts of an object segmentation decoding module and a memory bank, wherein the memory bank construction process comprises the steps from two to threeBy way of introduction, the multi-channel object segmentation unit configuration is introduced at steps four through seven.

Step two, firstly, a memory bank is constructed, and input information of the memory bank comes from two branches: firstly, single sample learning information extraction module

Output of (2)

The other is the output of the image frame sequence perception feature extraction module, and the information streams contain the pixel-level shallow and deep semantic features of the input image. The memory bank functions as: the context semantics of the target objects between target object components in the image and between the front frame sequence and the back frame sequence are very important for understanding the relevance between pixels, the semantic aggregation construction of the pixels is realized by selectively memorizing the characteristic information of the input image sequence by the memory library, the semantic understanding is enhanced by utilizing the context information of the time sequence image frames in the memory library for semantic segmentation, and a large amount of regional semantic segmentation references are provided for constructing the memory library.

Step three, the memory base stores semantic information by setting dynamic parameters as regional semantic comparison and pixel semantic aggregation, and the memory base

By

A target feature component, i.e.

Each target feature corresponds to a category.

Each term of (a) represents the number of the observed images Ima in the whole learning phase

Global region of individual classDomain aware representation

Represent

A vector of dimension real values. At each training stage, the memory base updates the new feature learning results. Specifically, the current depth information of the image Ima

By passing

Smooth update to memory representation

Wherein the element belongs to the hyper-parameter of the memory bank, and the value range is 0 < element < 0.06. When it comes to

Update when a category appears in the image Ima and the classification confidence is greater than a threshold

Step four, next, the structure of the target segmentation decoding module is introduced (step four to step seven), and the input information stream of the target segmentation decoding module comprises two branches: firstly, the real-time optimal segmentation code output of the segmentation tracking module; the second is the embedded representation of the memory bank output. The target segmentation decoding module compresses the embedded representations from the memory base into a compact set of representative sample embedded representations. For each category

Element(s)

All the features in the network are subjected to multi-classification of 7-layer convolutional network to obtain the network with

Is a characteristic representation

Matrix representation of dimensional vectors

Using multiple feature representations for each class (i.e. using multiple feature representations for each class)

) To account for feature variations within a class.

Step five, the target segmentation decoding module slave memory bank

All class feature representatives derived in (1) are expressed in the form of real tensors

Then have characteristics for each

Wherein W and H represent the width and height of the image, respectively, using an embedded representation

Computing its depth semantics

Representation of an algorithm

Wherein

Is that

Dimension matrix, tensor

And

are respectively reduced and flattened into matrix form representation

In order to achieve a high-speed calculation,

which represents the operation of transposition by means of a transposition operation,

represents a matrix multiplication in which

Is a hyperbolic tangent function expressed by

Each item in (1) reflects

Each row (i.e., feature) of and

a canonical generalized distance measure between each column (i.e., feature representation) in (a). Depth based semantics

And feature embedding

Computing a feature representation

Wherein

To represent

Is characterized by adjusting the tensor structure into

Step six, then

And original characteristics

Is connected into

Not only lie in

In encoding local context semantics within an image, further in

The global context of the front and back frame sequences between the captured images can enrich the representability of semantic understanding.

Step seven, the classification of the multi-channel target pixels output by the target segmentation decoding module comprises that the trunk network adopts DenseNet to map the input image Ima into convolution expression

And (3) adopting a convolutional neural network with the hidden layer depth of 9 to realize classification, generating a classification perception attention diagram from feature embedding, and adopting 5 multiplied by 5 as a convolutional kernel. Memory bank

All region patterns are stored in the training data. And the memory base reserves compressed global feature representative representation in the reasoning stage and outputs a video target tracking result through feature information fusion.

It is noted that, herein, relational terms such as first and second (a, b, etc.) and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims

1. A video target tracking method for single sample learning is characterized in that: the method comprises the following operation steps:

s3: sample(s)The module structure is as follows: single sample learning information extraction module

The structure of (1);

s4: the tracking module is constructed as follows: segmentation tracking module

The structure of (1);

And a segmentation tracking module

Reasoning on the segmentation of the subsequent frame image;

s6: and outputting a result: multi-channel object segmentation unit

The module is to divide the tracking module

The output optimal division code is used as input, the division decoder

The final output of (1) is a multi-channel target semantic segmentation result.

2. The method of claim 1, wherein the method comprises: the step S1 specifically comprises the following operation steps:

a1: constructing a semi-supervised image segmentation data set;

Wherein

3. The method for tracking the video target through single-sample learning according to claim 1, wherein: the step S2 specifically comprises the following operation steps:

Pixel classifier

Attention mechanism unit

B2: structured pixel classifier

Pixel classifier

The input is a single sample real value labeling label Yseg ₁ Through a pair ofInput target true value labeling label Yseg ₁ Encoding is carried out to predict the segmentation true value labels of other input images of the single-sample learning information extraction module;

4. The method of claim 1, wherein the method comprises: the step S3 specifically comprises the following operation steps:

c3: setting N =2 in the new optimization iteration times, a very ideal updating effect can be achieved, and the calculated amount is reduced to the minimum so that real-time processing can be realized;

c4: a single sample learning information extraction module is constructed

5. The method of claim 1, wherein the method comprises: the step S4 specifically comprises the following operation steps:

Tracking modules by pair segmentation

Output of (3) and generated true value labeling tag

Provided importance per pixel weight

Weighting;

d3: segmentation tracking module

6. The method for tracking the video target through single-sample learning according to claim 1, wherein: the step S5 specifically comprises the following operation steps:

e4: annotating Yseg with first frame true values ₁ (i.e. correspond to

In (1)

Wherein

) As a label in our single sample learning information extraction module;

e5: by labeling the actual value with the label Yseg ₁ Carry out code generation

To allow partitioning of the tracking module

e6: the method can guide the single-sample learning information extraction module to optimize learning and converge at the highest speed, and realize the real-time optimal segmentation coding output of the segmentation tracking module.

7. The method of claim 1, wherein the method comprises: the step S6 specifically comprises the following operation steps:

f1: multi-channel object segmentation unit

The input of (a) includes three information streams: one is real-time optimal segmentation code output of a segmentation tracking module; secondly, a single sample learning information extraction module

Output of (2)

Output of (2)

By

Each target feature is composed;

f4: the input information stream of the target segmentation decoding module comprises two branches: one is real-time optimal segmentation code output of a segmentation tracking module; second, the embedded representation of the memory bank output;

f5: target segmentation decoding module slave memory bank

F6: further on

f7: and the memory base reserves compressed global feature representative representation in an inference stage and outputs a video target tracking result through feature information fusion.

8. The method of claim 1, wherein the method comprises: and combining the pre-trained vectorization expression in the steps S1-S6, designing model parameters specific to the target object in the reasoning process to study, generalizing the appearance information of the target object, and realizing the existing optimal dynamic target tracking.