CN116596966A - Segmentation and tracking method based on attention and feature fusion - Google Patents

Segmentation and tracking method based on attention and feature fusion Download PDF

Info

Publication number
CN116596966A
CN116596966A CN202310519848.4A CN202310519848A CN116596966A CN 116596966 A CN116596966 A CN 116596966A CN 202310519848 A CN202310519848 A CN 202310519848A CN 116596966 A CN116596966 A CN 116596966A
Authority
CN
China
Prior art keywords
feature
attention
target
network
tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310519848.4A
Other languages
Chinese (zh)
Inventor
黄丹丹
胡力洲
刘智
陈广秋
杨明婷
于斯宇
郝文豪
王一雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Science and Technology
Original Assignee
Changchun University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Science and Technology filed Critical Changchun University of Science and Technology
Priority to CN202310519848.4A priority Critical patent/CN116596966A/en
Publication of CN116596966A publication Critical patent/CN116596966A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention belongs to the technical field of image information processing, in particular to a segmentation and tracking method based on attention and feature fusion, which comprises the following steps: step 1, constructing a SiamMask-based basic segmentation and tracking framework; step 2, adding a mixed attention module; step 3, adding feature fusion; step 4, model training, namely inputting sample pictures into a twin network for training, wherein the training process is performed offline; and 5, model testing. According to the invention, by adding the mixed attention template, the feature learning capability of the network is enhanced through the interdependent channel features and the spatial features, and the object representation with more discrimination is generated, so that the model performance is greatly improved.

Description

Segmentation and tracking method based on attention and feature fusion
Technical Field
The invention relates to the technical field of image information processing, in particular to a segmentation and tracking method based on attention and feature fusion.
Background
The single-target tracking is one of the hot spots in the field of computer vision research and is widely applied, the tracking focusing of a camera, the automatic target tracking of an unmanned aerial vehicle and the like are required to be used for single-target tracking technology, in addition, the tracking of specific objects, such as human body tracking, vehicle tracking in a traffic monitoring system, gesture tracking in a human face tracking and intelligent interaction system and the like, are required to be applied to the target tracking task, along with continuous deep research of researchers, the vision target tracking has breakthrough progress in more than ten years, so that a vision tracking algorithm is not only limited to a traditional machine learning method, but also combines the methods of artificial intelligent hot-tide-deep learning, a related filter and the like in recent years, robust and accurate results are obtained, along with great success of deep learning in the fields of voice recognition, image classification, target detection and the like, the deep learning framework is applied to target tracking task in more and more comprehensive development, and wider application of the target tracking technology is also obtained in a plurality of social fields, especially in the current social environment, and the requirements of various layers of society on a high-tech tracking mode show a continuously improved state, and the importance of the target tracking technology is further important.
The target tracking is to establish the position relation of the object to be tracked in a continuous video sequence to obtain the complete motion track of the object, the target coordinate position of the first frame of a given image is calculated, the exact position of the target in the next frame of image is calculated, the target possibly shows some image changes such as gesture or shape changes, scale changes, background shielding or light brightness changes and the like in the motion process, the research of the target tracking algorithm is also developed around solving the changes and specific application, the focus of the target tracking problem is the tracking accuracy and tracking speed, and along with the continuous improvement of the hardware performance of a computer, the target tracking can be more and more, so that the specific application based on single target tracking begins to appear in many fields.
With the development of object tracking, the requirements on the form and boundary accuracy of the object are higher and higher, so that the two problems can be unified into a framework by the object tracking method combined with image segmentation, the tracked object can be positioned more accurately, the segmentation and tracking problems of the video object are seemingly independent, but are complementary and inseparable in reality, that is, the solution of one problem usually solves the other problem directly or indirectly, many studies have paid attention to the problem of object segmentation and tracking, the respective difficulties can be overcome and the performance of the problem can be improved, and therefore, a segmentation and tracking method based on attention and feature fusion is proposed to solve the problem.
Disclosure of Invention
(one) solving the technical problems
Aiming at the defects of the prior art, the invention provides a segmentation and tracking method based on attention and feature fusion, which solves the problems in the background art.
(II) technical scheme
The invention adopts the following technical scheme for realizing the purposes:
a segmentation and tracking method based on attention and feature fusion comprises the following steps:
step 1, constructing a SiamMask-based basic segmentation and tracking framework;
step 2, adding a mixed attention module;
step 3, adding feature fusion;
step 4, model training, namely inputting sample pictures into a twin network for training, wherein the training process is performed offline;
and 5, model testing.
Further, in the step 1, the method comprises a twin sub-network, a feature extraction network ResNet-50, a cross-correlation operation layer with depth separable convolution and three branches of output; the twin sub-network layer is used for measuring the similarity of two inputs, the ResNet-50 layer and the depth separable convolution cross-correlation operation layer are used for generating a plurality of candidate window response characteristics, and three branches of the output are mask branches, bounding box branches and grading branches.
Further, the adding of the mixed attention module in the step 2 includes channel attention and space attention, wherein the two attention modules are connected in parallel, and the input is added after passing through the channel attention and the space attention modules respectively, so as to obtain a feature map with more accurate calibration; channel attention: changing [ C, H, W ] into [ C, 1] through a global average pooling method, then carrying out information processing by using two 1 multiplied by 1 convolutions to obtain a vector in the C dimension, normalizing by using a sigmoid function to obtain a corresponding mask, and multiplying by a channel to obtain the feature map subjected to information calibration; spatial attention: the feature map is directly convolved by 1X 1, the [ C, H, W ] is changed into the feature of the [1, H, W ], then the sigmoid is used for activating to obtain a space attention map, and finally the space attention map is directly applied to the original feature map, so that the space information calibration is completed.
Further, in the step 3, in order to enhance the CNN feature expression of the backbone network, the model design is divided into three parts; bottom-up structure, top-down structure, laterally connected structure. For the feature extraction network ResNet-50, only the first four stages are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, a feature fusion strategy is improved, the outputs of ResNet convolution blocks conv2, conv3 and conv4 are defined as { C2, C3 and C4}, the layers with unchanged sizes of the output feature graphs are classified into one stage, the features of the output of the last layer of each stage are extracted, the output is respectively 1/4,1/8,1/16 times of the original graph through downsampling, the feature graphs are fused with the feature graphs generated through upsampling from top to bottom through transverse connection, and finally the output P2, P3 and P4 are obtained through processing through convolution of 3*3.
Further, the step 4 model training includes the steps of:
s1, inputting sample pictures into a twin network for training, wherein the training process is performed offline. Training was performed using 4 data sets of COCO, imageNet-DET2015, imageNet-VID2015, and YouTube-VOS.
S2, the twin network is used for measuring the similarity of input samples: the sample comprises a target image and a search image, wherein the target image is 127 x 127 of images to be tracked, the search image is 255 x 255 of images to be executed to track the target; the twin neural network has two input branches, one branch target template Z as input and the other branch search region X as input. Two inputs are entered into two weight-shared neural networks that map the inputs to new spaces, respectively, forming representations of the inputs in the new spaces. The similarity of the two inputs was evaluated by calculation of the loss.
S3, extracting a network from the feature map: the target template and the search area are both put into the same feature extraction network, namely Resnet, only the first four stages of RseNet are used, a hole convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, the 3*3 convolution kernel is changed into a 7*7 convolution kernel, the stride is set to be 1, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;
s4, generating a feature map: for generating a target feature map, firstly selecting a target, preprocessing, finally cutting by using a cutting frame as a center, directly selecting the center of a preprocessed picture as the center, obtaining the coordinate of the cutting frame with the size of 127, then performing operations such as random scaling, random translation of a plurality of pixels, random overturning and the like on the obtained cutting frame, and finally obtaining the picture with the target at the center of the picture through affine transformation;
for the generation of the search area feature map, similar to the generation of the target feature map, besides cutting out the search on the original map, a mask is cut out of the mask map, and then data enhancement operations such as random blurring and overturning are needed to be performed on the synchronization of the picture and the mask. Then, performing depth separable convolution operation, wherein the obtained response keeps the number of channels unchanged, namely 256, the middle response is called RoW, namely the response of the candidate window, and then three branches are separated on the basis of RoW, and segmentation, regression and classification are respectively performed;
s5, pre-training: the network backbone is pre-trained on ImageNet-1 k. With SGD and one pre-training phase, i.e. the learning rate on the first five epochs increased from 0.001 to 0.005 and then decreased to 0.0003 on the next 15 epochs.
Further, the step 5 model test comprises the following steps:
s1, testing the trained model in the VOT data set to obtain a tracking effect.
S2, carrying out tracking effect visualization in a new video sequence, manually marking a rectangular frame in a first frame, including the central position and the size of a target to be tracked, then dividing the frame immediately followed by a tracking algorithm, and calculating the position offset and the size change of the target in the subsequent frame by the algorithm.
(III) beneficial effects
Compared with the prior art, the invention provides a segmentation and tracking method based on attention and feature fusion, which has the following beneficial effects:
according to the invention, by adding the mixed attention template, the feature learning capability of the network is enhanced through the interdependent channel features and the spatial features, and the object representation with more discrimination is generated, so that the model performance is greatly improved.
According to the invention, the problem of low model accuracy in tracking is solved by introducing feature fusion, so that the finally output features can better represent the information of each dimension of the input picture by utilizing the feature fusion, and the adaptability of the model is effectively improved.
The invention introduces multiple mixed attention to learn the characteristic weight according to loss through the network, so that the effective characteristic graph has high weight, and the ineffective or small-effect characteristic graph has small weight, so that the training model achieves better result.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a schematic diagram of a SiamMask network framework of the present invention;
FIG. 3 is a schematic diagram of a mixed attention architecture of the present invention;
FIG. 4 is a schematic diagram of a Resnet+ feature fusion architecture of the present invention;
FIG. 5 is a schematic diagram of a training architecture of the present invention;
FIG. 6 is a graph of the evaluation of VOT2018 test data set according to the present invention;
FIG. 7 is a schematic diagram of the structure of the result of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
As shown in fig. 1-7, a segmentation and tracking method based on attention and feature fusion according to an embodiment of the present invention includes the following steps:
as shown in fig. 2, step one, constructing a basic segmentation and tracking framework based on a siemmask; the core idea of Siammask target tracking is that an object image to be tracked is firstly framed through an initial frame and used as a retrieval basis of a subsequent frame; secondly, inputting the target and the search into a twin network simultaneously, outputting two feature images, and performing cross correlation on the two feature images to obtain a RoW feature image; next, a simple two-layer 1*1 convolution kernel convolution operation is input from the output feature map to obtain an output, and finally, the input of the mask of the target is generated, the input of the SiamMask refers to the twin network structure, and there are two inputs, namely a target image and a search image, wherein the target image is an image to be tracked, for example, 127×127×3, 127×127 represents the size of a template image, and 3 represents the number of channels. The search image refers to an image that is performed to track the target, for example 255×255×3, 255×255 representing the size of the search image, and 3 representing the number of channels;
the backbone network of the extracted features is a CNN structure shared by two branches, one branch target template Z is taken as input, the other branch search area X is taken as input, and the two branch target templates Z and the other branch search area X are simultaneously input into the twin network, f θ ResNet, f for shared weights θ The first four stages of RseNet are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, namely, the 3*3 convolution kernel is changed into 7*7, and the stage step distance is set to be 1, so that the fourth stage does not carry out resolution reduction on the feature map, finally, two feature maps are obtained through a backbone network, namely, a target image feature map and a search area feature map, F1 and F2 are generated feature maps, and the feature map F1 is smaller than the feature map F2 because the Z size is smaller than the X;
where d is the operation of the depth level, that is, the cross correlation operation of the depth separable convolution, to obtain a plurality of candidate window response features, that is, performing a correlation calculation on the two feature maps to obtain a candidate frame response feature map with a constant channel number, for example, 17×17×256, and the response in the middle of the candidate window response feature map is called RoW, that is, the response of the candidate window, for example, 1×1×256;
then three branches are separated on the basis of RoW, and the three branches are respectively divided, regressed and classified;
mask branching: h in segmentation task θ The network consists of two 1 x 1 convolution layers, one of which has 256 channels and one of which has 63 x 63 channels, wherein RoW of the 256 channels is extracted from the cross-correlation result and passes through two full-connection layers to obtain RoW of the 63 x 63 channels, and finally the RoW is expanded into a response chart of the 63 x 63, wherein the information in the whole RoW is utilized for classifying each pixel, the influence caused by objects similar to a target object is eliminated, and meanwhile, the lower-layer characteristics and the higher-layer characteristics are fused to generate a more accurate mask;
box branch: the frame of the whole algorithm is to input the current frame and the previous frame into a network, output the position of a bounding box of the current frame, send the current frame and the previous frame into the network for feature extraction after cutting out areas respectively, cascade the features and input the features into a full-connection layer, so as to find out where the target is moved in order to compare the features of the target with the features of the current frame, learn a complex feature comparison function by the full-connection layer, output the relative motion of the target, then connect the output of the full-connection layer to a layer of 4 nodes to respectively represent the coordinates of two corners of the bounding box so as to output the position of the target, and obtain a response graph of 4K channels through a convolution layer by RoW, namely the respective x, y, w and h offset of K anchor frames;
score branching, namely extracting the features of X and Z, and performing deep cross-correlation operation on the features to obtain a confidence map score, wherein the formula is as follows:
score=φ(z)*φ(x)+b
wherein z is a template image, x is a search image, phi is a conversion function, the number represents a cross-correlation operation, b is a learnable number, and the initial value is 0;
during training, the loss function is a binary logistic regression loss equation for the response to all candidate windows as follows:
wherein y is n E + 1 is the true label of the true pixel-by-pixel mask map of size w x h,a prediction for each pixel that is a response of the predicted nth candidate window;
the weighted loss function formula of the whole system is as follows:
L 3B =λ 1 ·L mask2 ·L score3 ·L box
mask branching is used for referencing the deep mask, a blue 256 channel RoW is extracted from the cross-correlation result and passes through two layers of full-connection layers, roW of a (63 x 63) channel is obtained, each column is unfolded to be a response chart of 63 x 63, the score of the coincidence degree or similarity degree of a certain position of a target template and a search area is represented, then a mask chart is obtained through up-sampling and resolution expansion, a box branch is used for obtaining a response chart of a 4K channel through a convolutional layer by RoW, namely, respective x, y, w and h offset values of K anchor frames are obtained through a score branch, the response chart of the 2K channel is the confidence level of the foreground and the background of the K anchor frames, and the two branches respectively use the smoth L1 loss and the cross entropy loss;
as shown in fig. 3, step two, adding a mixed attention module, connecting the channel attention module and the space attention module in parallel, inputting the two modules after respectively passing through the channel attention module and the space attention module, and adding the two modules to obtain a feature map with more accurate calibration;
the channel attention firstly extracts image features through a convolution layer, then compresses a feature matrix into feature vectors, then trains a full-connection layer to obtain the relation between the features and the classifications, finally calculates the probabilities of all classifications corresponding to the target, and inputs feature graphs of U= [ U1, U2, ], uc]Wherein each channel ui, ui ε R H*W U obtains a vector z, z E R after passing through a global pooling layer 1*1*C The formula for each position k is as follows:
the obtained vector is trained by a full connection layer twice to obtain the importance of each channel, then the importance of each channel is enhanced by a ReLU (reduced order of magnitude) process, and finally sigma is obtained by normalizing to 0 to 1 by a sigmod layer z' The whole process formula is as follows:
U 1 =[σ z'1 u1,σ z'2 u2,σ z'3 u3,...,σ z'n un]
wherein z' i Is the degree of importance of the ith channel ui,
the spatial attention is first to convolve the feature map directly to make the feature map from [ C, H, W ]]Becomes [1, H, W ]]Is then mapped to [0, 1] using a sigmoid function]The space attention graph is directly added into the original feature graph to complete the information calibration of the space, and the input feature graph is U= [ U ] 1,1 ,u 1,2 ,...,u i ,j ,...,u H,W ]H, W are the dimensions of the feature map, respectively, (i, j) is the spatial position of the feature map, the spatial extrusion operation is realized by convolution, where the convolution is a convolution of 1×1, and the number of output channels is 1, and the formula is as follows:
q=W sq ★U
wherein q is a characteristic diagram with the channel number of 1, and the characteristic diagram is normalized to [0-1] by sigmod, and the formula is as follows:
wherein the method comprises the steps ofRepresenting the importance of the spatial coordinates (i, j) in the feature map,
as shown in fig. 4, step three, adding feature fusion, wherein the feature fusion does not change the original backbone network, and the feature graphs at each layer of the original backbone network are taken out and then operated, and the model design is divided into three parts; bottom-up structure, top-down structure, transverse connection structure, bottom-up: as the network of the left half part deepens, the size of the feature map is continuously reduced, semantic information is more abundant, and the last feature map of each stage forms a feature pyramid from top to bottom: by means of upsampling, feature graphs are enlarged continuously, so that low-level features also contain rich semantic information, and the low-level features are connected transversely: fusing the up-sampling result and the feature map with the same size generated from bottom to top, namely, the feature map from the left is subjected to convolution operation of 1*1, then added with the feature map from the upper, and then subjected to convolution of 3*3 to obtain the feature output of the layer;
when the input is [320×320×64], the output of Layer1 Layer is [160×160×256], wherein the output and input dimensions of Layer1 of the resnet are the same, the downsampling of W and H is not performed by Layer1 in all resnet, the output of Layer2 Layer is [80×80×512] only for channel adjustment, the output of Layer3 Layer [40×40×1024], the Conv1 Layer and MaxPool Layer before Layer1 of H and W are respectively reduced by two times, so the height and width at Layer1 are reduced by 4 times, layer1 is not modified for height and width, each Layer of layers 2 to Layer3 is reduced by two times for height and width, and P2 to P4 are bbox, box resolution and mask for predicting objects in future;
inputting sample pictures into a network for training, wherein the training process is performed offline; the sample pictures are 127 x 127, the search pictures are 255 x 255, and the network backbone is pre-trained on ImageNet-1 k. Training for 20 cycles, wherein the learning rate on the first five epochs increases from 0.001 to 0.005 and decreases to 0.0003 on the last 15 epochs;
step five, model testing; by adopting the data set provided by the VOT official network and testing the training effect of the method according to the evaluation index of the VOT data set, as can be found from fig. 6, the single-target tracking algorithm provided by the invention has better performance than the original basic algorithm.
As shown in fig. 2, in some embodiments, the basic network framework is a SiamMask-based basic tracking framework, including a twin sub-network, a feature extraction network res net-50, a cross-correlation operation layer of depth separable convolution, and three branches of output; the twin sub-network layer is used for measuring the similarity of two inputs, the ResNet-50 layer and the cross-correlation operation layer of the depth separable convolution are used for generating a plurality of candidate window response characteristics, and three branches of output are mask branches, box branches and score branches.
mask branching: for predicting the target mask, the classification of each pixel utilizes the information in the whole RoW, eliminates the influence caused by objects similar to the target object, and simultaneously fuses the low-level features and the high-level features to generate a more accurate mask;
box branch: is a deviation value of the anchor frame and the position and the actual position for correcting the position of the last frame. The frame of the whole algorithm is to input the current frame and the previous frame into a network, output the position of a bounding box of the current frame, send the current frame and the previous frame into the network for feature extraction after cutting out areas respectively, cascade the features and input the features into a full-connection layer, so as to find out where the target is moved in order to compare the features of the target with the features of the current frame, learn a complex feature comparison function by the full-connection layer, output the relative motion of the target, then connect the output of the full-connection layer to a layer of 4 nodes to respectively represent the coordinates of two corners of the bounding box so as to output the position of the target, and obtain a response graph of 4K channels through a convolution layer by RoW, namely the respective x, y, w and h offset of K anchor frames;
score branch, which is used to distinguish whether the anchor frame is foreground or background, extract the features of template image and search image, and make depth cross-correlation operation to them to obtain confidence map score.
As shown in fig. 3, in some embodiments, the SiamMask network introduces a new mixed attention mechanism, where mixed attention includes channel attention and spatial attention, where two attention modules are connected in parallel, and the inputs are added after passing through the channel attention and spatial attention modules respectively, so as to obtain a feature map with more precise calibration; channel attention: changing [ C, H, W ] into [ C, 1] through a global average pooling method, then carrying out information processing by using two 1 multiplied by 1 convolutions to obtain a vector in the C dimension, normalizing by using a sigmoid function to obtain a corresponding mask, and multiplying by a channel to obtain the feature map subjected to information calibration;
spatial attention: directly convolving the feature map with 1X 1 to change [ C, H, W ] into [1, H, W ], activating with sigmoid to obtain space attention map, normalizing to [0-1], obtaining a space attention map with dimension of 1X H W, and performing feature recalibration to spatially multiply with the original U to finish space information calibration.
As shown in fig. 4, in some embodiments, the sialmmask network introduces feature fusion, enhances the backbone network CNN feature expression, and the model design is divided into three parts; the structure from bottom to top, the structure from top to bottom and the structure of transverse connection are adopted, aiming at a feature extraction network ResNet-50, only the first four stages are adopted, and a cavity convolution kernel with the expansion rate of 2 is adopted in the first layer convolution of the fourth stage, so that a feature fusion strategy is improved, outputs of ResNet convolution blocks conv2, conv3 and conv4 are defined as { C2, C3 and C4}, layers with unchanged sizes of the output feature graphs are classified into one stage, features output by the last layer of each stage are extracted, the outputs are respectively 1/4,1/8 and 1/16 times of the original graph through downsampling, then the features are fused with the feature graphs generated through the up-sampling from top to bottom through transverse connection, and finally the outputs P2, P3 and P4 are obtained through processing of the convolutions of 3*3.
The method comprises the steps of obtaining 64 x 256 by carrying out 256 1*1 convolution kernel operations on C4, marking the result as P4, carrying out up-sampling with a step length of 2 on P4, adding the result obtained by carrying out 256 1*1 convolution kernel operations on C3 to obtain P3, carrying out up-sampling with a step length of 2 on P3, and adding the result obtained by carrying out 256 1*1 convolution kernel operations on C2 to obtain P2.
As shown in fig. 5, in some embodiments, the model training includes the steps of:
firstly, obtaining a sample, inputting the sample into a network model for training, wherein the sample comprises a target image and a search image, the target image is 127X 127 of the tracked image, the search image is 255X 255 of the tracked image of the target, the backbone network is a CNN structure shared by two branches, one branch target template Z is taken as an input, the other branch search area X is taken as an input, the target template and the search area are both put into the same feature extraction network, namely Resnet, only the first four stages of RseNet are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, the 3*3 convolution kernel is changed into a 7*7 convolution kernel, the stride is set to be 1, and finally, two feature maps, namely a target image feature map and a search area feature map, are obtained through the feature extraction network;
for generating a target feature map, firstly selecting a target, preprocessing, finally cutting by using a cutting frame as a center, directly selecting the center of a preprocessed picture as the center, obtaining the coordinate of the cutting frame with the size of 127, then performing operations such as random scaling, random translation of a plurality of pixels, random overturning and the like on the obtained cutting frame, and finally obtaining the picture with the target at the center of the picture through affine transformation;
for the generation of the search area feature map, similar to the generation of the target feature map, besides cutting out the search on the original map, cutting out a mask in the mask map, then carrying out data enhancement operations such as random blurring, overturning and the like on the synchronization of the picture and the mask, then carrying out deep separable convolution operation, wherein the obtained response keeps the number of channels unchanged, namely 256, the response of a candidate window is called RoW, and then three branches are separated on the basis of RoW, and the segmentation, regression and classification are respectively carried out.
Wherein the segmentation: for predicting the target mask, regression: for distinguishing whether the anchor frame is foreground or background, classification: is a deviation value of the anchor frame and the position and the actual position for correcting the position of the last frame.
The network backbone is pre-trained on ImageNet-1 k. With SGD and one pre-training phase, i.e. the learning rate on the first five epochs increased from 0.001 to 0.005 and then decreased to 0.0003 on the next 15 epochs.
As shown in fig. 6, in some embodiments, the model test includes the steps of: testing the tracking effect of the trained model in a new video sequence;
in single object tracking, as shown in fig. 7, a rectangular box is typically given in the first frame, containing the center position and size of the object to be tracked, this box is typically manually marked, then the tracking algorithm is required to follow this box in the subsequent frame, and the position offset and size change of the object is calculated in the subsequent frame by the algorithm, and the test results on the VOT dataset are shown on the video sequence for a direct visual sense.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A segmentation and tracking method based on attention and feature fusion is characterized in that: the method comprises the following steps:
step 1, constructing a SiamMask-based basic segmentation and tracking framework;
step 2, adding a mixed attention module;
step 3, adding feature fusion;
step 4, model training, namely inputting sample pictures into a twin network for training, wherein the training process is performed offline;
and 5, model testing.
2. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: in the step 1, three branches of a twin sub-network, a feature extraction network ResNet-50, a cross-correlation operation layer with depth separable convolution and output are included; the twin sub-network layer is used for measuring the similarity of two inputs, the ResNet-50 layer and the depth separable convolution cross-correlation operation layer are used for generating a plurality of candidate window response characteristics, and three branches of the output are mask branches, bounding box branches and grading branches.
3. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: the step 2 of adding the mixed attention module comprises channel attention and space attention, wherein the two attention modules are connected in parallel, and the input is added after passing through the channel attention module and the space attention module respectively, so that a feature diagram with more accurate calibration is obtained; channel attention: changing [ C, H, W ] into [ C, 1] through a global average pooling method, then carrying out information processing by using two 1 multiplied by 1 convolutions to obtain a vector in the C dimension, normalizing by using a sigmoid function to obtain a corresponding mask, and multiplying by a channel to obtain the feature map subjected to information calibration; spatial attention: the feature map is directly convolved by 1X 1, the [ C, H, W ] is changed into the feature of the [1, H, W ], then the sigmoid is used for activating to obtain a space attention map, and finally the space attention map is directly applied to the original feature map, so that the space information calibration is completed.
4. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: in the step 3, in order to strengthen the CNN feature expression of the backbone network, the model design is divided into three parts; bottom-up structure, top-down structure, laterally connected structure. For the feature extraction network ResNet-50, only the first four stages are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, a feature fusion strategy is improved, the outputs of ResNet convolution blocks conv2, conv3 and conv4 are defined as { C2, C3 and C4}, the layers with unchanged sizes of the output feature graphs are classified into one stage, the features of the output of the last layer of each stage are extracted, the output is respectively 1/4,1/8,1/16 times of the original graph through downsampling, the feature graphs are fused with the feature graphs generated through upsampling from top to bottom through transverse connection, and finally the output P2, P3 and P4 are obtained through processing through convolution of 3*3.
5. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: the step 4 model training comprises the following steps:
s1, inputting sample pictures into a twin network for training, wherein the training process is performed offline. Training was performed using 4 data sets of COCO, imageNet-DET2015, imageNet-VID2015, and YouTube-VOS.
S2, the twin network is used for measuring the similarity of input samples: the sample comprises a target image and a search image, wherein the target image is 127 x 127 of images to be tracked, the search image is 255 x 255 of images to be executed to track the target; the twin neural network has two input branches, one branch target template Z as input and the other branch search region X as input. Two inputs are entered into two weight-shared neural networks that map the inputs to new spaces, respectively, forming representations of the inputs in the new spaces. The similarity of the two inputs was evaluated by calculation of the loss.
S3, extracting a network from the feature map: the target template and the search area are both put into the same feature extraction network, namely Resnet, only the first four stages of RseNet are used, a hole convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, the 3*3 convolution kernel is changed into a 7*7 convolution kernel, the stride is set to be 1, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;
s4, generating a feature map: for generating a target feature map, firstly selecting a target, preprocessing, finally cutting by using a cutting frame as a center, directly selecting the center of a preprocessed picture as the center, obtaining the coordinate of the cutting frame with the size of 127, then performing operations such as random scaling, random translation of a plurality of pixels, random overturning and the like on the obtained cutting frame, and finally obtaining the picture with the target at the center of the picture through affine transformation;
for the generation of the search area feature map, similar to the generation of the target feature map, besides cutting out the search on the original map, a mask is cut out of the mask map, and then data enhancement operations such as random blurring and overturning are needed to be performed on the synchronization of the picture and the mask. Then, performing depth separable convolution operation, wherein the obtained response keeps the number of channels unchanged, namely 256, the middle response is called RoW, namely the response of the candidate window, and then three branches are separated on the basis of RoW, and segmentation, regression and classification are respectively performed;
s5, pre-training: the network backbone is pre-trained on ImageNet-1 k. With SGD and one pre-training phase, i.e. the learning rate on the first five epochs increased from 0.001 to 0.005 and then decreased to 0.0003 on the next 15 epochs.
6. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: the step 5 model test comprises the following steps:
s1, testing the trained model in the VOT data set to obtain a tracking effect.
S2, carrying out tracking effect visualization in a new video sequence, manually marking a rectangular frame in a first frame, including the central position and the size of a target to be tracked, then dividing the frame immediately followed by a tracking algorithm, and calculating the position offset and the size change of the target in the subsequent frame by the algorithm.
CN202310519848.4A 2023-05-10 2023-05-10 Segmentation and tracking method based on attention and feature fusion Pending CN116596966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310519848.4A CN116596966A (en) 2023-05-10 2023-05-10 Segmentation and tracking method based on attention and feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310519848.4A CN116596966A (en) 2023-05-10 2023-05-10 Segmentation and tracking method based on attention and feature fusion

Publications (1)

Publication Number Publication Date
CN116596966A true CN116596966A (en) 2023-08-15

Family

ID=87600115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310519848.4A Pending CN116596966A (en) 2023-05-10 2023-05-10 Segmentation and tracking method based on attention and feature fusion

Country Status (1)

Country Link
CN (1) CN116596966A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392392A (en) * 2023-12-13 2024-01-12 河南科技学院 Rubber cutting line identification and generation method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392392A (en) * 2023-12-13 2024-01-12 河南科技学院 Rubber cutting line identification and generation method
CN117392392B (en) * 2023-12-13 2024-02-13 河南科技学院 Rubber cutting line identification and generation method

Similar Documents

Publication Publication Date Title
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
Zhou et al. Split depth-wise separable graph-convolution network for road extraction in complex environments from high-resolution remote-sensing images
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN106547880B (en) Multi-dimensional geographic scene identification method fusing geographic area knowledge
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN110910391B (en) Video object segmentation method for dual-module neural network structure
CN111612008B (en) Image segmentation method based on convolution network
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN113609896B (en) Object-level remote sensing change detection method and system based on dual-related attention
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN113344932A (en) Semi-supervised single-target video segmentation method
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
Xu et al. AMCA: Attention-guided multi-scale context aggregation network for remote sensing image change detection
CN116740527A (en) Remote sensing image change detection method combining U-shaped network and self-attention mechanism
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
Chacon-Murguia et al. Moving object detection in video sequences based on a two-frame temporal information CNN
CN116311353A (en) Intensive pedestrian multi-target tracking method based on feature fusion, computer equipment and storage medium
CN115713546A (en) Lightweight target tracking algorithm for mobile terminal equipment
Zheng et al. DCU-NET: Self-supervised monocular depth estimation based on densely connected U-shaped convolutional neural networks
CN114708591A (en) Document image Chinese character detection method based on single character connection
CN113223006A (en) Lightweight target semantic segmentation method based on deep learning
Yian et al. Improved deeplabv3+ network segmentation method for urban road scenes
Chen et al. Building extraction from high-resolution remote sensing imagery based on multi-scale feature fusion and enhancement
CN116486203B (en) Single-target tracking method based on twin network and online template updating

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination