CN112734803A - Single target tracking method, device, equipment and storage medium based on character description - Google Patents

Single target tracking method, device, equipment and storage medium based on character description Download PDF

Info

Publication number
CN112734803A
CN112734803A CN202011642602.9A CN202011642602A CN112734803A CN 112734803 A CN112734803 A CN 112734803A CN 202011642602 A CN202011642602 A CN 202011642602A CN 112734803 A CN112734803 A CN 112734803A
Authority
CN
China
Prior art keywords
visual
feature
target
character
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011642602.9A
Other languages
Chinese (zh)
Other versions
CN112734803B (en
Inventor
张伟
吴爽
陈佳铭
宋然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202011642602.9A priority Critical patent/CN112734803B/en
Publication of CN112734803A publication Critical patent/CN112734803A/en
Application granted granted Critical
Publication of CN112734803B publication Critical patent/CN112734803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Abstract

The invention discloses a single target tracking method, a device, equipment and a storage medium based on character description, wherein the single target tracking method comprises the following steps: dividing a video to be tracked into a plurality of video packets according to a set frame number on average; extracting first, second and third character features from the character description; respectively extracting first, second and third visual features from the nth sampling frame of each video packet; updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; respectively fusing the updated first, second and third character feature vectors with the fourth to ninth visual features to obtain fused features; and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the fusion characteristics.

Description

Single target tracking method, device, equipment and storage medium based on character description
Technical Field
The present application relates to the field of machine vision and natural language processing technologies, and in particular, to a single-target tracking method, apparatus, device, and storage medium based on text description.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Single target tracking is a classic and long-term research topic in the field of machine vision. Conventional single-object tracking methods typically manually mark a box in a frame of a video for an object to be tracked. In recent years, the related topics combining machine vision and natural language processing technologies, such as image/video annotation, visual question answering and the like, have been greatly improved, and the single-target tracking topic based on the text description has been increasingly paid attention. Given a segment of text label, tracking the target labeled with text in the video can enable the algorithm to better handle many complex scenes, such as occlusion, frame offset, target deformation, blurring, and the like. Because the semantic information provided by the natural language description can help the target tracking algorithm to mitigate the effects of these complex scenarios.
However, the single-target tracking subject based on the text description has a special problem. Natural language may describe the appearance and motion state of an object in the first frame, or describe the motion process of an object in the whole video, and it is not feasible to label each frame of the video with text. For a common single-target tracking dataset with natural language annotations, the text annotations generally describe the entire content of the video, and no dataset annotates all frames. However, the position and appearance of the target is constantly changing in the video, and thus natural language markup is not able to accurately describe the position or motion of the target in most scenarios. While past related work has performed well, they only view textual labels as a global constraint.
Disclosure of Invention
In order to overcome the defects of the prior art, the application provides a single-target tracking method, a single-target tracking device, single-target tracking equipment and a single-target tracking storage medium based on text description;
in a first aspect, the application provides a single-target visual tracking method based on text description;
the single-target visual tracking method based on the text description comprises the following steps:
obtaining a template image of a target to be tracked; acquiring a video to be tracked and a text description related to a target to be tracked; dividing a video to be tracked into a plurality of video packets according to a set frame number on average;
extracting first, second and third character features from the character description;
respectively extracting first, second and third visual features from the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; the template image of the target to be tracked refers to a first frame image of a video to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet;
fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;
and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion characteristics.
In a second aspect, the present application provides a single-target visual tracking device based on textual description;
single target visual tracking device based on word description includes:
a video packet partitioning module configured to: obtaining a template image of a target to be tracked; acquiring a video to be tracked and a text description related to a target to be tracked; dividing a video to be tracked into a plurality of video packets according to a set frame number on average;
a text feature extraction module configured to: extracting first, second and third character features from the character description;
a visual feature extraction module configured to: respectively extracting first, second and third visual features from the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; the template image of the target to be tracked refers to a first frame image of a video to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet;
a feature fusion module configured to: fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;
an output module configured to: and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion characteristics.
In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.
Compared with the prior art, the beneficial effects of this application are:
the depth visual feature of the text description is updated by the aid of the depth visual feature of the search area generated in the tracking process, so that the text feature with the expected depth can change along with the change of the target in the video, and the accuracy of a single-target tracking algorithm is improved.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2 is a flow chart of a method of the first embodiment;
fig. 3(a) -3 (g) are schematic views of the effects of the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a single-target visual tracking method based on text description;
the single-target visual tracking method based on the text description comprises the following steps:
s101: obtaining a template image of a target to be tracked; acquiring a video to be tracked and a text description related to a target to be tracked; dividing a video to be tracked into a plurality of video packets according to a set frame number on average;
s102: extracting first, second and third character features from the character description;
s103: respectively extracting first, second and third visual features from the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value;
updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features;
respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; the template image of the target to be tracked refers to a first frame image of a video to be tracked;
extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet;
s104: fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;
s105: and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion characteristics.
Illustratively, the video to be tracked is divided into a plurality of video packets according to a set number of frames on average; for example, 1000 frames of video to be tracked are averagely divided into 10 video packets according to 100 frames as a unit; for another example, 100 frames of video to be tracked are divided into 10 video packets on average in units of 10 frames.
As one or more embodiments, the S102: extracting first, second and third character features from the character description; the method comprises the following specific steps:
and extracting the first, second and third character features from the character description by adopting a BERT method.
As one or more embodiments, the S103: respectively extracting first, second and third visual features from the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; the method comprises the following specific steps:
visual feature extraction is carried out on the nth sampling frame of each video packet by adopting RestNet-50;
convolutional layer Conv2_3 outputs a first visual characteristic;
convolutional layer Conv3_4 outputs a second visual characteristic;
convolutional layer Conv5_3 outputs a third visual characteristic.
As one or more embodiments, the S103: updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; the method comprises the following specific steps:
the first visual feature is subjected to Global Average Pooling (GAP) processing to obtain a first sub-visual feature; taking the first character characteristic as an initial hidden state of a first LSTM model; inputting the first sub-visual feature into a first LSTM model at a set time t, and outputting an updated first character feature by the first LSTM model; in the first LSTM model, a forgetting gate is used for determining whether the hidden state at the current moment should be abandoned; an input gate for deciding whether a value of the input visual feature should be written;
the second visual feature is subjected to global average pooling to obtain a second sub-visual sub-feature; taking the second character characteristic as an initial hidden state of a second LSTM model; inputting the second sub-visual feature into a second LSTM model at a set time t, and outputting the updated second character feature by the second LSTM model;
the third visual feature is subjected to global average pooling to obtain a third sub-visual feature; taking the third character characteristic as an initial hidden state of a third LSTM model; and inputting the third sub-visual feature into a third LSTM model at the set time t, and outputting the updated third character feature by the third LSTM model.
As one or more embodiments, the S103: respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; the template image of the target to be tracked refers to a first frame image of a video to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet; the method comprises the following specific steps:
adopting RestNet-50 to extract visual characteristics of a sample plate image of the target to be tracked;
the convolutional layer Conv2_3 of RestNet-50 outputs a fourth visual characteristic;
the convolutional layer Conv3_4 of RestNet-50 outputs a fifth visual characteristic;
the convolutional layer Conv5_3 of RestNet-50 outputs a sixth visual characteristic.
Performing visual feature extraction on the search area image of the target to be tracked by adopting RestNet-50;
the convolutional layer Conv2_3 of RestNet-50 outputs a seventh visual characteristic;
the convolutional layer Conv3_4 of RestNet-50 outputs an eighth visual characteristic;
the convolutional layer Conv5_3 of RestNet-50 outputs the ninth visual characteristic.
As one or more embodiments, the S104: fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features; the method comprises the following specific steps:
splicing the updated first character feature vector with the fourth visual feature to obtain a first fusion feature;
splicing the updated second character feature vector with the fifth visual feature to obtain a second fusion feature;
splicing the updated third character feature vector with the sixth visual feature to obtain a third fusion feature;
splicing the updated first character feature vector with the seventh visual feature to obtain a fourth fusion feature;
splicing the updated second character feature vector with the eighth visual feature to obtain a fifth fusion feature;
and splicing the updated third character feature vector with the ninth visual feature to obtain a sixth fusion feature.
As one or more embodiments, the S105: obtaining a target tracking result of each frame in a current video packet of the video to be tracked according to the six fusion characteristics; the method comprises the following specific steps:
inputting the first fusion characteristic into a first Convolutional Neural Network (CNN), and inputting an output value of the first convolutional neural network and an output value of a fourth convolutional neural network into a first classification network; obtaining a first classification result;
inputting the fourth fusion characteristic into a fourth convolutional neural network CNN, and inputting the output value of the fourth convolutional neural network and the output value of the first convolutional neural network into the first regression network; obtaining a first regression result;
inputting the second fusion characteristic into a second convolutional neural network CNN, and inputting the output value of the second convolutional neural network and the output value of a fifth convolutional neural network into a second classification network; obtaining a second classification result;
inputting the fifth fusion characteristic into a fifth convolutional neural network CNN, and inputting the output value of the fifth convolutional neural network and the output value of the second convolutional neural network into the second regression network; obtaining a second regression result;
inputting the third fusion characteristic into a third convolutional neural network CNN, and inputting an output value of the third convolutional neural network and an output value of a sixth convolutional neural network into a third classification network; obtaining a third classification result;
inputting the sixth fusion characteristic into a sixth convolutional neural network CNN, and inputting the output value of the sixth convolutional neural network and the output value of the third convolutional neural network into a third regression network; obtaining a third regression result;
fusing the first classification result, the second classification result and the third classification result to obtain a final classification result;
fusing the first regression result, the second regression result and the third regression result to obtain a final regression result;
and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the final classification result and the final regression result.
The method provided by the application is characterized in that a text feature updating module based on a Long-Short Term Memory network (LSTM) uses the depth feature of the initial text description as an initial hidden state, and the depth feature of the current frame is input at set frame intervals to update the text feature as the hidden state, so that the depth text feature is expected to be correspondingly changed when the target moves or the appearance changes in the video. And then, fusing the updated depth character features with the depth visual features of the next set frame number. The target of each frame is detected based on the fusion characteristics by using a SimRPN method.
Past tracking algorithms typically employ detection or matching that randomly selects positive and negative samples from the data set during the training process. In order to update the deep text feature, the problem of time series must be taken into account. Therefore, the feature updating module is trained by using a serialized training method, each video segment is divided into the same number of segments, and the number of frames in each segment can be different.
The main contributions of the present application are as follows: a text feature updating module is provided to reduce the gap between text expression and visual information such as the position and appearance of a target. A serialized training method is provided to train a text feature update module to achieve the expectation of updating deep text features.
Single-target tracking using manually labeled target boxes is a long-standing challenge in the field of machine vision, and researchers have proposed many single-target tracking algorithms, typically algorithms based on Correlation Filter (CF) and algorithms based on Recurrent Neural Network (RNN). In recent years, twin structures based on matching networks have attracted increasing attention for their accuracy and efficiency. SiamFC, SiamRPN + +, SiamMask, etc. twin network based algorithms.
In recent years, single-target tracking algorithm research based on word description is receiving more and more attention, but most algorithms regard the word description as global constraint of a single-target tracking subject, and the limitation of the word description is ignored.
Given a piece of video and a piece of text label related to a tracked target, the purpose of the application is to track the current target in the video. The main challenge in most scenarios is that text labels cannot accurately describe the position and appearance of the tracking target in different frames. To solve this problem, the present application proposes a tracking algorithm comprising two modules: a feature update module and a trace module, the details of which will be described below.
A feature update module: the feature updating module aims to reduce the limitation of the character description in the single-target tracking subject and enable the updated deep character feature to better reflect the state of the tracking target. The feature update module presented herein accomplishes the task of feature update by using a set of LSTM networks.
The feature update module contains three parallel LSTM units. Firstly, using a BERT (bidirectional Encoding retrieval from transforms) method to encode a character into a feature vector with 768 dimensions, then using a full-connection network to fully connect the character feature vector to 512 dimensions, then using the character feature as an initial hidden state of an LSTM unit at an initial time, and at a specific time t, updating the hidden state by the LSTM in the following way:
ft=σ(ωf[lt-1,vt]+bf)
it=σ(ωi[lt-1,vt]+bi)
lt=ftΘlt-1+itΘtanh(ωlvt+bl)
wherein ltAnd vtRespectively representing the hidden state of the LSTM initialized with textual features and the visual features of the input LSTM. f. oftAnd itRespectively showing the forgetting gate and the input gate of the LSTM unit. The forgetting gate determines whether the value of the hidden state at the current time should be discarded, and the input gate determines whether the value of the depth visual feature input at the current time should be written. ω and b represent weights and bias parameters for trainable gate operations. σ and Θ denote sigmoid activation functions and Hamiltonian operations.
At time t, LSTM inputs depth visual features to handle hidden state lt-1. By initializing the LSTM hidden state with text features and updating the hidden state through input gate and forget gate operations, the depth text features can change as the position and appearance of the tracking target changes.
The three parallel LSTM network input serialized deep visual feature updates the hidden state initialized by the character feature to update the character feature and enable the character feature to change along with the position and appearance change of the tracking target. The visual information of the video can be efficiently expanded and the deep text characteristics can be enriched.
In the structure of the twin network, the template image of the target and the search area image of the current frame are extracted and output with the ResNet50 network, and then the visual features with three different depths are input into three parallel LSTM networks after being globally pooled, and the depth character features updated by the LSTM networks will change with the visual features.
A tracking module: the tracking module provided by the application finds a region with high similarity with a template image in a search region image by inputting the template image containing a target and the search region image, and the region is used as a result of a tracking algorithm. Different from the traditional twin network which carries out pre-cutting and filling work on the image of the search area, the method and the device do not cut the original image but fill the original image to the size of the standard input. In most scenarios, maintaining the size of the artwork may maintain the association between the location information of the target and the text label. Template images used in the training process are manually marked from a data set, and the template images are obtained by using a Visual grouping method in the testing stage.
As shown in fig. 1, similar to the update module, the depth visual features of the template image and the search area image are extracted by the same ResNet50 network, and then the depth features of the template image and the search area image are fused with the updated text features. The updated character features are fully connected into feature vectors with 256, then 1 × 1 × 256 one-dimensional feature vectors are stacked into dimensions 7 × 7 × 256 and dimensions 31 × 31 × 256 (the dimensions 7 × 7 are the same as those of the template image features, and the dimensions 31 × 31 are the same as those of the search area image features), and then the character features and the visual features are connected together for fusion. The fusion features further reduce the ambiguity in language description by using visual information, and can improve the target perception capability of the visual features. Next, the fused features are processed using a Convolutional Neural Network (CNN). Finally, the fused features are input into a candidate area network in the twin network structure to detect the tracking target. The outputs of the classification branch and the regression branch of the candidate area network are the foreground and background classification of the detection frame and the regression of the target frame. Similar to the traditional twin network, we use two classes of cross-entropy loss and smooth L1 loss.
The classification losses are as follows:
Figure BDA0002880563740000121
wherein y isiRepresenting the foreground background prediction of the nth candidate region.
Ax,Ay,Aw,AhX, y-axis coordinates, width and height, T, respectively representing the center point of the candidate boxx,Ty,Tw,ThThe coordinate and the width, height and four-dimensional standard distance of the real tracking target frame are shown as follows
Figure BDA0002880563740000131
The regression loss is then as follows:
Figure BDA0002880563740000132
Figure BDA0002880563740000133
total loss of Ltotal=Lcls+λLregWhere λ represents the hyperparameter of equilibrium classification and regression loss.
Application details: in the training process, the video is divided into 50 fragments and packets containing the natural language annotations, the size of a frame in each packet is adjusted to be 255 multiplied by 255, and the size is used for updating the module and the image of the search area of the twin network. Meanwhile, a template image including a tracking target is input to the twin network as a sample.
Text labels are encoded into feature vectors through BERT and full-connection networks and used to initialize the hidden state of the LSTM network. And then, the updating module updates the character features serving as the hidden states according to the depth visual features so as to improve the target perception capability of the character features in the re-searching image sequence. And fusing the updated depth character features with the template image and the search area image, and finally using the fused features for predicting and tracking the position of the target by the twin network.
The application uses a modified ResNet50 network and is pre-trained on ImageNet datasets. The model uses a Momentum optimizer with an attenuation rate of 1 × 10-4And momentum is set to 0.9, the initial learning rate is 5 × 10-3And each round of training is reduced by 1 x 10-4The training batch size is 32. Each video segment is cut into 50 video segments, that is, each video segment is updated with 50 times of depth character features. The model was trained for 5, 10, 15, 20 rounds and tested.
In the application process, the template image is generated by using a visual grouping method. And predicting the text content corresponding to one square box in the image by Visual grouping through text marking. When a box is available to track the object, the template image can be cut out from the first frame of the video through the box. Meanwhile, the visual grouping method is also used for recovering the tracking result after the tracking algorithm loses the target.
The experimental results are as follows: the experimental results are shown and analyzed next. The experimental data set and evaluation criteria will be presented first, along with some details of the application. And then shows the comparison result with the conventional method. The present application also analyzes the model in different settings and attempts to explain the model behavior and the working principle. The experiments of the present application were run on Inter Xeon CPU E5-2687W v33.10GHz and NVIDIA Tesla V100 GPUs.
The data sets used for the experiment were the LaSOT data set and the Linear OTB99 data set, since each video in both data sets was labeled with text. The LaSOT data set is a large reference data set for single target tracking, and comprises 1400 video sequences, wherein each video has a segment of natural language label and each frame has a target frame. The data set has 1120 video segments for training and 280 video segments for testing. Because the main purpose of the text annotation of the LaSOT data set is to assist the tracking process, and the description of the text annotation on the target is not accurate enough, the text annotation is modified partially to reduce the ambiguity of the text. The linear OTB99 data set is an extended version of the OTB100 data set, and each piece of video is labeled with a piece of text. The data set contains 51 training videos and 48 test videos.
The method is the same as the traditional tracking algorithm, and the precision and the success rate are used as evaluation criteria of the tracking algorithm. The accuracy represents the percentage of the number of frames that the predicted target frame and the true target frame overlap by more than a given threshold. The success rate represents the percentage of the number of frames that the intersection ratio of the predicted target box and the real target box is above a certain threshold.
Compared with the single-target tracking algorithm with the text description, the single-target tracking method has the advantages that compared with the single-target tracking algorithms with the text description, the single-target tracking is carried out only by using the given text label, and the initialization is carried out by simultaneously using the target box and the text label of the first frame. As shown in the table, the algorithm of the present application performed better than the conventional algorithm in both LaSOT and Linguar OTB99 under both initialization methods.
Partial tracking results are shown in fig. 2, under the assistance of a tracking module, the performance of the model is better than that of a plurality of algorithms which use character labeling initialization and use target frame initialization of a first frame, the performance of the model under the interference of shielding, frame offset and the like is robust, and the model can be recovered to track a correct target after the target exceeds a visual field and tracks an incorrect target. The present application also compares this model to other tracking algorithms that are initialized with the target box of the first frame only. As shown in Table 1, the results of the model using only text label initialization are competitive with the results of the algorithm using the target box initialization of the first frame, and the model performs better when using the target box initialization and text label initialization than the tracking algorithm using the target box initialization. And (4) conclusion: in a single-target tracking topic, a concise text annotation to a piece of video in general may describe the state of a target in the first frame of the video or the motion of the target throughout the video rather than its exact position and appearance in each frame, as these properties of the target may change constantly from frame to frame. The application provides a brand-new feature updating module for a single-target visual tracking algorithm based on character description, uses an LSTM network to update a deep character feature, and fuses the updated deep character feature and the deep visual feature to improve the performance of the single-target tracking algorithm. The experimental result shows that the text description can assist in improving the single-target tracking algorithm and achieve better single-target tracking performance. Fig. 3(a) -3 (g) are schematic views of the effects of the first embodiment.
Table 1 data comparison table of experimental results
Figure BDA0002880563740000161
Example two
The embodiment provides a single-target visual tracking device based on text description;
single target visual tracking device based on word description includes:
a video packet partitioning module configured to: obtaining a template image of a target to be tracked; acquiring a video to be tracked and a text description related to a target to be tracked; dividing a video to be tracked into a plurality of video packets according to a set frame number on average;
a text feature extraction module configured to: extracting first, second and third character features from the character description;
a visual feature extraction module configured to: respectively extracting first, second and third visual features from the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; the template image of the target to be tracked refers to a first frame image of a video to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet;
a feature fusion module configured to: fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;
an output module configured to: and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion characteristics.
It should be noted here that the video packet dividing module, the text feature extracting module, the visual feature extracting module, the feature fusing module, and the output module correspond to steps S101 to S105 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. The single-target visual tracking method based on the character description is characterized by comprising the following steps:
obtaining a template image of a target to be tracked; acquiring a video to be tracked and a text description related to a target to be tracked; dividing a video to be tracked into a plurality of video packets according to a set frame number on average;
extracting first, second and third character features from the character description;
respectively extracting first, second and third visual features from the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; the template image of the target to be tracked refers to a first frame image of a video to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet;
fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;
and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion characteristics.
2. The single-target visual tracking method based on the text description as claimed in claim 1, wherein first, second and third text features are extracted from the text description; the method comprises the following specific steps:
and extracting the first, second and third character features from the character description by adopting a BERT method.
3. The single-target visual tracking method based on text description as claimed in claim 1, wherein the first, second and third visual features are extracted for the nth sampling frame of each video packet respectively; n is a positive integer, and the upper limit of n is a specified value; the method comprises the following specific steps:
visual feature extraction is carried out on the nth sampling frame of each video packet by adopting RestNet-50; convolutional layer Conv2_3 outputs a first visual characteristic; convolutional layer Conv3_4 outputs a second visual characteristic; convolutional layer Conv5_3 outputs a third visual characteristic.
4. The single-target visual tracking method based on text description as claimed in claim 1, wherein the first, second and third text features are updated based on the first, second and third visual features of the nth sampling frame of each video packet, respectively, to obtain updated first, second and third text features; the method comprises the following specific steps:
the first visual feature is subjected to global average pooling to obtain a first sub-visual feature; taking the first character characteristic as an initial hidden state of a first LSTM model; inputting the first sub-visual feature into a first LSTM model at a set time t, and outputting an updated first character feature by the first LSTM model; in the first LSTM model, a forgetting gate is used for determining whether the hidden state at the current moment should be abandoned; an input gate for deciding whether a value of the input visual feature should be written;
the second visual feature is subjected to global average pooling to obtain a second sub-visual sub-feature; taking the second character characteristic as an initial hidden state of a second LSTM model; inputting the second sub-visual feature into a second LSTM model at a set time t, and outputting the updated second character feature by the second LSTM model;
the third visual feature is subjected to global average pooling to obtain a third sub-visual feature; taking the third character characteristic as an initial hidden state of a third LSTM model; and inputting the third sub-visual feature into a third LSTM model at the set time t, and outputting the updated third character feature by the third LSTM model.
5. The single-target visual tracking method based on text description as claimed in claim 1, wherein the fourth, fifth and sixth visual features are extracted from the template image of the target to be tracked, respectively; the template image of the target to be tracked refers to a first frame image of a video to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet; the method comprises the following specific steps:
adopting RestNet-50 to extract visual characteristics of a sample plate image of the target to be tracked; the convolutional layer Conv2_3 of RestNet-50 outputs a fourth visual characteristic; the convolutional layer Conv3_4 of RestNet-50 outputs a fifth visual characteristic; the convolutional layer Conv5_3 of RestNet-50 outputs a sixth visual characteristic;
performing visual feature extraction on the search area image of the target to be tracked by adopting RestNet-50; the convolutional layer Conv2_3 of RestNet-50 outputs a seventh visual characteristic; the convolutional layer Conv3_4 of RestNet-50 outputs an eighth visual characteristic; the convolutional layer Conv5_3 of RestNet-50 outputs the ninth visual characteristic.
6. The single-target visual tracking method based on text description according to claim 1,
fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features; the method comprises the following specific steps:
splicing the updated first character feature vector with the fourth visual feature to obtain a first fusion feature; splicing the updated second character feature vector with the fifth visual feature to obtain a second fusion feature; splicing the updated third character feature vector with the sixth visual feature to obtain a third fusion feature; splicing the updated first character feature vector with the seventh visual feature to obtain a fourth fusion feature; splicing the updated second character feature vector with the eighth visual feature to obtain a fifth fusion feature; and splicing the updated third character feature vector with the ninth visual feature to obtain a sixth fusion feature.
7. The single-target visual tracking method based on text description according to claim 1,
obtaining a target tracking result of each frame in a current video packet of the video to be tracked according to the six fusion characteristics; the method comprises the following specific steps:
inputting the first fusion characteristic into a first Convolutional Neural Network (CNN), and inputting an output value of the first convolutional neural network and an output value of a fourth convolutional neural network into a first classification network; obtaining a first classification result;
inputting the fourth fusion characteristic into a fourth convolutional neural network CNN, and inputting the output value of the fourth convolutional neural network and the output value of the first convolutional neural network into the first regression network; obtaining a first regression result;
inputting the second fusion characteristic into a second convolutional neural network CNN, and inputting the output value of the second convolutional neural network and the output value of a fifth convolutional neural network into a second classification network; obtaining a second classification result;
inputting the fifth fusion characteristic into a fifth convolutional neural network CNN, and inputting the output value of the fifth convolutional neural network and the output value of the second convolutional neural network into the second regression network; obtaining a second regression result;
inputting the third fusion characteristic into a third convolutional neural network CNN, and inputting an output value of the third convolutional neural network and an output value of a sixth convolutional neural network into a third classification network; obtaining a third classification result;
inputting the sixth fusion characteristic into a sixth convolutional neural network CNN, and inputting the output value of the sixth convolutional neural network and the output value of the third convolutional neural network into a third regression network; obtaining a third regression result;
fusing the first classification result, the second classification result and the third classification result to obtain a final classification result;
fusing the first regression result, the second regression result and the third regression result to obtain a final regression result;
and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the final classification result and the final regression result.
8. Single target visual tracking device based on word description, characterized by includes:
a video packet partitioning module configured to: obtaining a template image of a target to be tracked; acquiring a video to be tracked and a text description related to a target to be tracked; dividing a video to be tracked into a plurality of video packets according to a set frame number on average;
a text feature extraction module configured to: extracting first, second and third character features from the character description;
a visual feature extraction module configured to: respectively extracting first, second and third visual features from the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; the template image of the target to be tracked refers to a first frame image of a video to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; the search area image refers to all images in the current video packet;
a feature fusion module configured to: fusing the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;
an output module configured to: and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion characteristics.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
CN202011642602.9A 2020-12-31 2020-12-31 Single target tracking method, device, equipment and storage medium based on character description Active CN112734803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011642602.9A CN112734803B (en) 2020-12-31 2020-12-31 Single target tracking method, device, equipment and storage medium based on character description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011642602.9A CN112734803B (en) 2020-12-31 2020-12-31 Single target tracking method, device, equipment and storage medium based on character description

Publications (2)

Publication Number Publication Date
CN112734803A true CN112734803A (en) 2021-04-30
CN112734803B CN112734803B (en) 2023-03-24

Family

ID=75609164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011642602.9A Active CN112734803B (en) 2020-12-31 2020-12-31 Single target tracking method, device, equipment and storage medium based on character description

Country Status (1)

Country Link
CN (1) CN112734803B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298142A (en) * 2021-05-24 2021-08-24 南京邮电大学 Target tracking method based on deep space-time twin network
CN114241586A (en) * 2022-02-21 2022-03-25 飞狐信息技术(天津)有限公司 Face detection method and device, storage medium and electronic equipment
CN115496975A (en) * 2022-08-29 2022-12-20 锋睿领创(珠海)科技有限公司 Auxiliary weighted data fusion method, device, equipment and storage medium
CN116091551A (en) * 2023-03-14 2023-05-09 中南大学 Target retrieval tracking method and system based on multi-mode fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569723A (en) * 2019-08-02 2019-12-13 西安工业大学 Target tracking method combining feature fusion and model updating
CN110781951A (en) * 2019-10-23 2020-02-11 中国科学院自动化研究所 Visual tracking method based on thalamus dynamic allocation and based on multi-visual cortex information fusion
WO2020211624A1 (en) * 2019-04-18 2020-10-22 腾讯科技(深圳)有限公司 Object tracking method, tracking processing method, corresponding apparatus and electronic device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020211624A1 (en) * 2019-04-18 2020-10-22 腾讯科技(深圳)有限公司 Object tracking method, tracking processing method, corresponding apparatus and electronic device
CN110569723A (en) * 2019-08-02 2019-12-13 西安工业大学 Target tracking method combining feature fusion and model updating
CN110781951A (en) * 2019-10-23 2020-02-11 中国科学院自动化研究所 Visual tracking method based on thalamus dynamic allocation and based on multi-visual cortex information fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHENYANG LI ET AL.: "Tracking by Natural Language Specification", 《2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298142A (en) * 2021-05-24 2021-08-24 南京邮电大学 Target tracking method based on deep space-time twin network
CN113298142B (en) * 2021-05-24 2023-11-17 南京邮电大学 Target tracking method based on depth space-time twin network
CN114241586A (en) * 2022-02-21 2022-03-25 飞狐信息技术(天津)有限公司 Face detection method and device, storage medium and electronic equipment
CN115496975A (en) * 2022-08-29 2022-12-20 锋睿领创(珠海)科技有限公司 Auxiliary weighted data fusion method, device, equipment and storage medium
CN115496975B (en) * 2022-08-29 2023-08-18 锋睿领创(珠海)科技有限公司 Auxiliary weighted data fusion method, device, equipment and storage medium
CN116091551A (en) * 2023-03-14 2023-05-09 中南大学 Target retrieval tracking method and system based on multi-mode fusion
CN116091551B (en) * 2023-03-14 2023-06-20 中南大学 Target retrieval tracking method and system based on multi-mode fusion

Also Published As

Publication number Publication date
CN112734803B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN112734803B (en) Single target tracking method, device, equipment and storage medium based on character description
CN110738207B (en) Character detection method for fusing character area edge information in character image
Fan et al. Multi-level contextual rnns with attention model for scene labeling
CN110866140A (en) Image feature extraction model training method, image searching method and computer equipment
CN110334589B (en) High-time-sequence 3D neural network action identification method based on hole convolution
CN106845430A (en) Pedestrian detection and tracking based on acceleration region convolutional neural networks
CN109271539B (en) Image automatic labeling method and device based on deep learning
CN111160350B (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN111709311A (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
Liu et al. Visual attention in deep learning: a review
CN112766170B (en) Self-adaptive segmentation detection method and device based on cluster unmanned aerial vehicle image
CN113361645B (en) Target detection model construction method and system based on meta learning and knowledge memory
CN110390294A (en) Target tracking method based on bidirectional long-short term memory neural network
Viraktamath et al. Comparison of YOLOv3 and SSD algorithms
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN114743130A (en) Multi-target pedestrian tracking method and system
Li A deep learning-based text detection and recognition approach for natural scenes
Liu Real-Time Object Detection for Autonomous Driving Based on Deep Learning
Rakowski et al. Hand shape recognition using very deep convolutional neural networks
CN111767919A (en) Target detection method for multi-layer bidirectional feature extraction and fusion
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN114898290A (en) Real-time detection method and system for marine ship
Jokela Person counter using real-time object detection and a small neural network
Visalatchi et al. Intelligent Vision with TensorFlow using Neural Network Algorithms
Pham et al. Vietnamese Scene Text Detection and Recognition using Deep Learning: An Empirical Study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant