CN116596966A

CN116596966A - Segmentation and tracking method based on attention and feature fusion

Info

Publication number: CN116596966A
Application number: CN202310519848.4A
Authority: CN
Inventors: 黄丹丹; 胡力洲; 刘智; 陈广秋; 杨明婷; 于斯宇; 郝文豪; 王一雯
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-15

Abstract

The invention belongs to the technical field of image information processing, in particular to a segmentation and tracking method based on attention and feature fusion, which comprises the following steps: step 1, constructing a SiamMask-based basic segmentation and tracking framework; step 2, adding a mixed attention module; step 3, adding feature fusion; step 4, model training, namely inputting sample pictures into a twin network for training, wherein the training process is performed offline; and 5, model testing. According to the invention, by adding the mixed attention template, the feature learning capability of the network is enhanced through the interdependent channel features and the spatial features, and the object representation with more discrimination is generated, so that the model performance is greatly improved.

Description

Segmentation and tracking method based on attention and feature fusion

Technical Field

The invention relates to the technical field of image information processing, in particular to a segmentation and tracking method based on attention and feature fusion.

Background

The single-target tracking is one of the hot spots in the field of computer vision research and is widely applied, the tracking focusing of a camera, the automatic target tracking of an unmanned aerial vehicle and the like are required to be used for single-target tracking technology, in addition, the tracking of specific objects, such as human body tracking, vehicle tracking in a traffic monitoring system, gesture tracking in a human face tracking and intelligent interaction system and the like, are required to be applied to the target tracking task, along with continuous deep research of researchers, the vision target tracking has breakthrough progress in more than ten years, so that a vision tracking algorithm is not only limited to a traditional machine learning method, but also combines the methods of artificial intelligent hot-tide-deep learning, a related filter and the like in recent years, robust and accurate results are obtained, along with great success of deep learning in the fields of voice recognition, image classification, target detection and the like, the deep learning framework is applied to target tracking task in more and more comprehensive development, and wider application of the target tracking technology is also obtained in a plurality of social fields, especially in the current social environment, and the requirements of various layers of society on a high-tech tracking mode show a continuously improved state, and the importance of the target tracking technology is further important.

The target tracking is to establish the position relation of the object to be tracked in a continuous video sequence to obtain the complete motion track of the object, the target coordinate position of the first frame of a given image is calculated, the exact position of the target in the next frame of image is calculated, the target possibly shows some image changes such as gesture or shape changes, scale changes, background shielding or light brightness changes and the like in the motion process, the research of the target tracking algorithm is also developed around solving the changes and specific application, the focus of the target tracking problem is the tracking accuracy and tracking speed, and along with the continuous improvement of the hardware performance of a computer, the target tracking can be more and more, so that the specific application based on single target tracking begins to appear in many fields.

With the development of object tracking, the requirements on the form and boundary accuracy of the object are higher and higher, so that the two problems can be unified into a framework by the object tracking method combined with image segmentation, the tracked object can be positioned more accurately, the segmentation and tracking problems of the video object are seemingly independent, but are complementary and inseparable in reality, that is, the solution of one problem usually solves the other problem directly or indirectly, many studies have paid attention to the problem of object segmentation and tracking, the respective difficulties can be overcome and the performance of the problem can be improved, and therefore, a segmentation and tracking method based on attention and feature fusion is proposed to solve the problem.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a segmentation and tracking method based on attention and feature fusion, which solves the problems in the background art.

(II) technical scheme

The invention adopts the following technical scheme for realizing the purposes:

a segmentation and tracking method based on attention and feature fusion comprises the following steps:

step 1, constructing a SiamMask-based basic segmentation and tracking framework;

step 2, adding a mixed attention module;

step 3, adding feature fusion;

step 4, model training, namely inputting sample pictures into a twin network for training, wherein the training process is performed offline;

and 5, model testing.

Further, in the step 1, the method comprises a twin sub-network, a feature extraction network ResNet-50, a cross-correlation operation layer with depth separable convolution and three branches of output; the twin sub-network layer is used for measuring the similarity of two inputs, the ResNet-50 layer and the depth separable convolution cross-correlation operation layer are used for generating a plurality of candidate window response characteristics, and three branches of the output are mask branches, bounding box branches and grading branches.

Further, the adding of the mixed attention module in the step 2 includes channel attention and space attention, wherein the two attention modules are connected in parallel, and the input is added after passing through the channel attention and the space attention modules respectively, so as to obtain a feature map with more accurate calibration; channel attention: changing [ C, H, W ] into [ C, 1] through a global average pooling method, then carrying out information processing by using two 1 multiplied by 1 convolutions to obtain a vector in the C dimension, normalizing by using a sigmoid function to obtain a corresponding mask, and multiplying by a channel to obtain the feature map subjected to information calibration; spatial attention: the feature map is directly convolved by 1X 1, the [ C, H, W ] is changed into the feature of the [1, H, W ], then the sigmoid is used for activating to obtain a space attention map, and finally the space attention map is directly applied to the original feature map, so that the space information calibration is completed.

Further, in the step 3, in order to enhance the CNN feature expression of the backbone network, the model design is divided into three parts; bottom-up structure, top-down structure, laterally connected structure. For the feature extraction network ResNet-50, only the first four stages are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, a feature fusion strategy is improved, the outputs of ResNet convolution blocks conv2, conv3 and conv4 are defined as { C2, C3 and C4}, the layers with unchanged sizes of the output feature graphs are classified into one stage, the features of the output of the last layer of each stage are extracted, the output is respectively 1/4,1/8,1/16 times of the original graph through downsampling, the feature graphs are fused with the feature graphs generated through upsampling from top to bottom through transverse connection, and finally the output P2, P3 and P4 are obtained through processing through convolution of 3*3.

Further, the step 4 model training includes the steps of:

s1, inputting sample pictures into a twin network for training, wherein the training process is performed offline. Training was performed using 4 data sets of COCO, imageNet-DET2015, imageNet-VID2015, and YouTube-VOS.

S2, the twin network is used for measuring the similarity of input samples: the sample comprises a target image and a search image, wherein the target image is 127 x 127 of images to be tracked, the search image is 255 x 255 of images to be executed to track the target; the twin neural network has two input branches, one branch target template Z as input and the other branch search region X as input. Two inputs are entered into two weight-shared neural networks that map the inputs to new spaces, respectively, forming representations of the inputs in the new spaces. The similarity of the two inputs was evaluated by calculation of the loss.

S3, extracting a network from the feature map: the target template and the search area are both put into the same feature extraction network, namely Resnet, only the first four stages of RseNet are used, a hole convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, the 3*3 convolution kernel is changed into a 7*7 convolution kernel, the stride is set to be 1, and finally two feature maps are obtained through the feature extraction network, namely a target image feature map and a search area feature map;

s4, generating a feature map: for generating a target feature map, firstly selecting a target, preprocessing, finally cutting by using a cutting frame as a center, directly selecting the center of a preprocessed picture as the center, obtaining the coordinate of the cutting frame with the size of 127, then performing operations such as random scaling, random translation of a plurality of pixels, random overturning and the like on the obtained cutting frame, and finally obtaining the picture with the target at the center of the picture through affine transformation;

for the generation of the search area feature map, similar to the generation of the target feature map, besides cutting out the search on the original map, a mask is cut out of the mask map, and then data enhancement operations such as random blurring and overturning are needed to be performed on the synchronization of the picture and the mask. Then, performing depth separable convolution operation, wherein the obtained response keeps the number of channels unchanged, namely 256, the middle response is called RoW, namely the response of the candidate window, and then three branches are separated on the basis of RoW, and segmentation, regression and classification are respectively performed;

s5, pre-training: the network backbone is pre-trained on ImageNet-1 k. With SGD and one pre-training phase, i.e. the learning rate on the first five epochs increased from 0.001 to 0.005 and then decreased to 0.0003 on the next 15 epochs.

Further, the step 5 model test comprises the following steps:

s1, testing the trained model in the VOT data set to obtain a tracking effect.

S2, carrying out tracking effect visualization in a new video sequence, manually marking a rectangular frame in a first frame, including the central position and the size of a target to be tracked, then dividing the frame immediately followed by a tracking algorithm, and calculating the position offset and the size change of the target in the subsequent frame by the algorithm.

(III) beneficial effects

Compared with the prior art, the invention provides a segmentation and tracking method based on attention and feature fusion, which has the following beneficial effects:

according to the invention, by adding the mixed attention template, the feature learning capability of the network is enhanced through the interdependent channel features and the spatial features, and the object representation with more discrimination is generated, so that the model performance is greatly improved.

According to the invention, the problem of low model accuracy in tracking is solved by introducing feature fusion, so that the finally output features can better represent the information of each dimension of the input picture by utilizing the feature fusion, and the adaptability of the model is effectively improved.

The invention introduces multiple mixed attention to learn the characteristic weight according to loss through the network, so that the effective characteristic graph has high weight, and the ineffective or small-effect characteristic graph has small weight, so that the training model achieves better result.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic diagram of a SiamMask network framework of the present invention;

FIG. 3 is a schematic diagram of a mixed attention architecture of the present invention;

FIG. 4 is a schematic diagram of a Resnet+ feature fusion architecture of the present invention;

FIG. 5 is a schematic diagram of a training architecture of the present invention;

FIG. 6 is a graph of the evaluation of VOT2018 test data set according to the present invention;

FIG. 7 is a schematic diagram of the structure of the result of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1-7, a segmentation and tracking method based on attention and feature fusion according to an embodiment of the present invention includes the following steps:

as shown in fig. 2, step one, constructing a basic segmentation and tracking framework based on a siemmask; the core idea of Siammask target tracking is that an object image to be tracked is firstly framed through an initial frame and used as a retrieval basis of a subsequent frame; secondly, inputting the target and the search into a twin network simultaneously, outputting two feature images, and performing cross correlation on the two feature images to obtain a RoW feature image; next, a simple two-layer 1*1 convolution kernel convolution operation is input from the output feature map to obtain an output, and finally, the input of the mask of the target is generated, the input of the SiamMask refers to the twin network structure, and there are two inputs, namely a target image and a search image, wherein the target image is an image to be tracked, for example, 127×127×3, 127×127 represents the size of a template image, and 3 represents the number of channels. The search image refers to an image that is performed to track the target, for example 255×255×3, 255×255 representing the size of the search image, and 3 representing the number of channels;

the backbone network of the extracted features is a CNN structure shared by two branches, one branch target template Z is taken as input, the other branch search area X is taken as input, and the two branch target templates Z and the other branch search area X are simultaneously input into the twin network, f _θ ResNet, f for shared weights _θ The first four stages of RseNet are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, namely, the 3*3 convolution kernel is changed into 7*7, and the stage step distance is set to be 1, so that the fourth stage does not carry out resolution reduction on the feature map, finally, two feature maps are obtained through a backbone network, namely, a target image feature map and a search area feature map, F1 and F2 are generated feature maps, and the feature map F1 is smaller than the feature map F2 because the Z size is smaller than the X;

where d is the operation of the depth level, that is, the cross correlation operation of the depth separable convolution, to obtain a plurality of candidate window response features, that is, performing a correlation calculation on the two feature maps to obtain a candidate frame response feature map with a constant channel number, for example, 17×17×256, and the response in the middle of the candidate window response feature map is called RoW, that is, the response of the candidate window, for example, 1×1×256;

then three branches are separated on the basis of RoW, and the three branches are respectively divided, regressed and classified;

mask branching: h in segmentation task _θ The network consists of two 1 x 1 convolution layers, one of which has 256 channels and one of which has 63 x 63 channels, wherein RoW of the 256 channels is extracted from the cross-correlation result and passes through two full-connection layers to obtain RoW of the 63 x 63 channels, and finally the RoW is expanded into a response chart of the 63 x 63, wherein the information in the whole RoW is utilized for classifying each pixel, the influence caused by objects similar to a target object is eliminated, and meanwhile, the lower-layer characteristics and the higher-layer characteristics are fused to generate a more accurate mask;

box branch: the frame of the whole algorithm is to input the current frame and the previous frame into a network, output the position of a bounding box of the current frame, send the current frame and the previous frame into the network for feature extraction after cutting out areas respectively, cascade the features and input the features into a full-connection layer, so as to find out where the target is moved in order to compare the features of the target with the features of the current frame, learn a complex feature comparison function by the full-connection layer, output the relative motion of the target, then connect the output of the full-connection layer to a layer of 4 nodes to respectively represent the coordinates of two corners of the bounding box so as to output the position of the target, and obtain a response graph of 4K channels through a convolution layer by RoW, namely the respective x, y, w and h offset of K anchor frames;

score branching, namely extracting the features of X and Z, and performing deep cross-correlation operation on the features to obtain a confidence map score, wherein the formula is as follows:

score＝φ(z)*φ(x)+b

wherein z is a template image, x is a search image, phi is a conversion function, the number represents a cross-correlation operation, b is a learnable number, and the initial value is 0;

during training, the loss function is a binary logistic regression loss equation for the response to all candidate windows as follows:

wherein y is _n E + 1 is the true label of the true pixel-by-pixel mask map of size w x h,a prediction for each pixel that is a response of the predicted nth candidate window;

the weighted loss function formula of the whole system is as follows:

L _3B ＝λ ₁ ·L _mask +λ ₂ ·L _score +λ ₃ ·L _box

mask branching is used for referencing the deep mask, a blue 256 channel RoW is extracted from the cross-correlation result and passes through two layers of full-connection layers, roW of a (63 x 63) channel is obtained, each column is unfolded to be a response chart of 63 x 63, the score of the coincidence degree or similarity degree of a certain position of a target template and a search area is represented, then a mask chart is obtained through up-sampling and resolution expansion, a box branch is used for obtaining a response chart of a 4K channel through a convolutional layer by RoW, namely, respective x, y, w and h offset values of K anchor frames are obtained through a score branch, the response chart of the 2K channel is the confidence level of the foreground and the background of the K anchor frames, and the two branches respectively use the smoth L1 loss and the cross entropy loss;

as shown in fig. 3, step two, adding a mixed attention module, connecting the channel attention module and the space attention module in parallel, inputting the two modules after respectively passing through the channel attention module and the space attention module, and adding the two modules to obtain a feature map with more accurate calibration;

the channel attention firstly extracts image features through a convolution layer, then compresses a feature matrix into feature vectors, then trains a full-connection layer to obtain the relation between the features and the classifications, finally calculates the probabilities of all classifications corresponding to the target, and inputs feature graphs of U= [ U1, U2, ], uc]Wherein each channel ui, ui ε R ^H*W U obtains a vector z, z E R after passing through a global pooling layer ^1*1*C The formula for each position k is as follows:

the obtained vector is trained by a full connection layer twice to obtain the importance of each channel, then the importance of each channel is enhanced by a ReLU (reduced order of magnitude) process, and finally sigma is obtained by normalizing to 0 to 1 by a sigmod layer _z' The whole process formula is as follows:

U ₁ ＝[σ _z'1 u1，σ _z'2 u2，σ _z'3 u3，...，σ _z'n un]

wherein z' _i Is the degree of importance of the ith channel ui,

the spatial attention is first to convolve the feature map directly to make the feature map from [ C, H, W ]]Becomes [1, H, W ]]Is then mapped to [0, 1] using a sigmoid function]The space attention graph is directly added into the original feature graph to complete the information calibration of the space, and the input feature graph is U= [ U ] ^1,1 ，u ^1,2 ，...，u ⁱ ^,j ，...，u ^H,W ]H, W are the dimensions of the feature map, respectively, (i, j) is the spatial position of the feature map, the spatial extrusion operation is realized by convolution, where the convolution is a convolution of 1×1, and the number of output channels is 1, and the formula is as follows:

q＝W _sq ★U

wherein q is a characteristic diagram with the channel number of 1, and the characteristic diagram is normalized to [0-1] by sigmod, and the formula is as follows:

wherein the method comprises the steps ofRepresenting the importance of the spatial coordinates (i, j) in the feature map,

as shown in fig. 4, step three, adding feature fusion, wherein the feature fusion does not change the original backbone network, and the feature graphs at each layer of the original backbone network are taken out and then operated, and the model design is divided into three parts; bottom-up structure, top-down structure, transverse connection structure, bottom-up: as the network of the left half part deepens, the size of the feature map is continuously reduced, semantic information is more abundant, and the last feature map of each stage forms a feature pyramid from top to bottom: by means of upsampling, feature graphs are enlarged continuously, so that low-level features also contain rich semantic information, and the low-level features are connected transversely: fusing the up-sampling result and the feature map with the same size generated from bottom to top, namely, the feature map from the left is subjected to convolution operation of 1*1, then added with the feature map from the upper, and then subjected to convolution of 3*3 to obtain the feature output of the layer;

when the input is [320×320×64], the output of Layer1 Layer is [160×160×256], wherein the output and input dimensions of Layer1 of the resnet are the same, the downsampling of W and H is not performed by Layer1 in all resnet, the output of Layer2 Layer is [80×80×512] only for channel adjustment, the output of Layer3 Layer [40×40×1024], the Conv1 Layer and MaxPool Layer before Layer1 of H and W are respectively reduced by two times, so the height and width at Layer1 are reduced by 4 times, layer1 is not modified for height and width, each Layer of layers 2 to Layer3 is reduced by two times for height and width, and P2 to P4 are bbox, box resolution and mask for predicting objects in future;

inputting sample pictures into a network for training, wherein the training process is performed offline; the sample pictures are 127 x 127, the search pictures are 255 x 255, and the network backbone is pre-trained on ImageNet-1 k. Training for 20 cycles, wherein the learning rate on the first five epochs increases from 0.001 to 0.005 and decreases to 0.0003 on the last 15 epochs;

step five, model testing; by adopting the data set provided by the VOT official network and testing the training effect of the method according to the evaluation index of the VOT data set, as can be found from fig. 6, the single-target tracking algorithm provided by the invention has better performance than the original basic algorithm.

As shown in fig. 2, in some embodiments, the basic network framework is a SiamMask-based basic tracking framework, including a twin sub-network, a feature extraction network res net-50, a cross-correlation operation layer of depth separable convolution, and three branches of output; the twin sub-network layer is used for measuring the similarity of two inputs, the ResNet-50 layer and the cross-correlation operation layer of the depth separable convolution are used for generating a plurality of candidate window response characteristics, and three branches of output are mask branches, box branches and score branches.

mask branching: for predicting the target mask, the classification of each pixel utilizes the information in the whole RoW, eliminates the influence caused by objects similar to the target object, and simultaneously fuses the low-level features and the high-level features to generate a more accurate mask;

box branch: is a deviation value of the anchor frame and the position and the actual position for correcting the position of the last frame. The frame of the whole algorithm is to input the current frame and the previous frame into a network, output the position of a bounding box of the current frame, send the current frame and the previous frame into the network for feature extraction after cutting out areas respectively, cascade the features and input the features into a full-connection layer, so as to find out where the target is moved in order to compare the features of the target with the features of the current frame, learn a complex feature comparison function by the full-connection layer, output the relative motion of the target, then connect the output of the full-connection layer to a layer of 4 nodes to respectively represent the coordinates of two corners of the bounding box so as to output the position of the target, and obtain a response graph of 4K channels through a convolution layer by RoW, namely the respective x, y, w and h offset of K anchor frames;

score branch, which is used to distinguish whether the anchor frame is foreground or background, extract the features of template image and search image, and make depth cross-correlation operation to them to obtain confidence map score.

As shown in fig. 3, in some embodiments, the SiamMask network introduces a new mixed attention mechanism, where mixed attention includes channel attention and spatial attention, where two attention modules are connected in parallel, and the inputs are added after passing through the channel attention and spatial attention modules respectively, so as to obtain a feature map with more precise calibration; channel attention: changing [ C, H, W ] into [ C, 1] through a global average pooling method, then carrying out information processing by using two 1 multiplied by 1 convolutions to obtain a vector in the C dimension, normalizing by using a sigmoid function to obtain a corresponding mask, and multiplying by a channel to obtain the feature map subjected to information calibration;

spatial attention: directly convolving the feature map with 1X 1 to change [ C, H, W ] into [1, H, W ], activating with sigmoid to obtain space attention map, normalizing to [0-1], obtaining a space attention map with dimension of 1X H W, and performing feature recalibration to spatially multiply with the original U to finish space information calibration.

As shown in fig. 4, in some embodiments, the sialmmask network introduces feature fusion, enhances the backbone network CNN feature expression, and the model design is divided into three parts; the structure from bottom to top, the structure from top to bottom and the structure of transverse connection are adopted, aiming at a feature extraction network ResNet-50, only the first four stages are adopted, and a cavity convolution kernel with the expansion rate of 2 is adopted in the first layer convolution of the fourth stage, so that a feature fusion strategy is improved, outputs of ResNet convolution blocks conv2, conv3 and conv4 are defined as { C2, C3 and C4}, layers with unchanged sizes of the output feature graphs are classified into one stage, features output by the last layer of each stage are extracted, the outputs are respectively 1/4,1/8 and 1/16 times of the original graph through downsampling, then the features are fused with the feature graphs generated through the up-sampling from top to bottom through transverse connection, and finally the outputs P2, P3 and P4 are obtained through processing of the convolutions of 3*3.

The method comprises the steps of obtaining 64 x 256 by carrying out 256 1*1 convolution kernel operations on C4, marking the result as P4, carrying out up-sampling with a step length of 2 on P4, adding the result obtained by carrying out 256 1*1 convolution kernel operations on C3 to obtain P3, carrying out up-sampling with a step length of 2 on P3, and adding the result obtained by carrying out 256 1*1 convolution kernel operations on C2 to obtain P2.

As shown in fig. 5, in some embodiments, the model training includes the steps of:

firstly, obtaining a sample, inputting the sample into a network model for training, wherein the sample comprises a target image and a search image, the target image is 127X 127 of the tracked image, the search image is 255X 255 of the tracked image of the target, the backbone network is a CNN structure shared by two branches, one branch target template Z is taken as an input, the other branch search area X is taken as an input, the target template and the search area are both put into the same feature extraction network, namely Resnet, only the first four stages of RseNet are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, the 3*3 convolution kernel is changed into a 7*7 convolution kernel, the stride is set to be 1, and finally, two feature maps, namely a target image feature map and a search area feature map, are obtained through the feature extraction network;

for generating a target feature map, firstly selecting a target, preprocessing, finally cutting by using a cutting frame as a center, directly selecting the center of a preprocessed picture as the center, obtaining the coordinate of the cutting frame with the size of 127, then performing operations such as random scaling, random translation of a plurality of pixels, random overturning and the like on the obtained cutting frame, and finally obtaining the picture with the target at the center of the picture through affine transformation;

for the generation of the search area feature map, similar to the generation of the target feature map, besides cutting out the search on the original map, cutting out a mask in the mask map, then carrying out data enhancement operations such as random blurring, overturning and the like on the synchronization of the picture and the mask, then carrying out deep separable convolution operation, wherein the obtained response keeps the number of channels unchanged, namely 256, the response of a candidate window is called RoW, and then three branches are separated on the basis of RoW, and the segmentation, regression and classification are respectively carried out.

Wherein the segmentation: for predicting the target mask, regression: for distinguishing whether the anchor frame is foreground or background, classification: is a deviation value of the anchor frame and the position and the actual position for correcting the position of the last frame.

The network backbone is pre-trained on ImageNet-1 k. With SGD and one pre-training phase, i.e. the learning rate on the first five epochs increased from 0.001 to 0.005 and then decreased to 0.0003 on the next 15 epochs.

As shown in fig. 6, in some embodiments, the model test includes the steps of: testing the tracking effect of the trained model in a new video sequence;

in single object tracking, as shown in fig. 7, a rectangular box is typically given in the first frame, containing the center position and size of the object to be tracked, this box is typically manually marked, then the tracking algorithm is required to follow this box in the subsequent frame, and the position offset and size change of the object is calculated in the subsequent frame by the algorithm, and the test results on the VOT dataset are shown on the video sequence for a direct visual sense.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A segmentation and tracking method based on attention and feature fusion is characterized in that: the method comprises the following steps:

step 2, adding a mixed attention module;

step 3, adding feature fusion;

and 5, model testing.

2. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: in the step 1, three branches of a twin sub-network, a feature extraction network ResNet-50, a cross-correlation operation layer with depth separable convolution and output are included; the twin sub-network layer is used for measuring the similarity of two inputs, the ResNet-50 layer and the depth separable convolution cross-correlation operation layer are used for generating a plurality of candidate window response characteristics, and three branches of the output are mask branches, bounding box branches and grading branches.

3. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: the step 2 of adding the mixed attention module comprises channel attention and space attention, wherein the two attention modules are connected in parallel, and the input is added after passing through the channel attention module and the space attention module respectively, so that a feature diagram with more accurate calibration is obtained; channel attention: changing [ C, H, W ] into [ C, 1] through a global average pooling method, then carrying out information processing by using two 1 multiplied by 1 convolutions to obtain a vector in the C dimension, normalizing by using a sigmoid function to obtain a corresponding mask, and multiplying by a channel to obtain the feature map subjected to information calibration; spatial attention: the feature map is directly convolved by 1X 1, the [ C, H, W ] is changed into the feature of the [1, H, W ], then the sigmoid is used for activating to obtain a space attention map, and finally the space attention map is directly applied to the original feature map, so that the space information calibration is completed.

4. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: in the step 3, in order to strengthen the CNN feature expression of the backbone network, the model design is divided into three parts; bottom-up structure, top-down structure, laterally connected structure. For the feature extraction network ResNet-50, only the first four stages are used, a cavity convolution kernel with the expansion rate of 2 is used in the first layer convolution of the fourth stage, a feature fusion strategy is improved, the outputs of ResNet convolution blocks conv2, conv3 and conv4 are defined as { C2, C3 and C4}, the layers with unchanged sizes of the output feature graphs are classified into one stage, the features of the output of the last layer of each stage are extracted, the output is respectively 1/4,1/8,1/16 times of the original graph through downsampling, the feature graphs are fused with the feature graphs generated through upsampling from top to bottom through transverse connection, and finally the output P2, P3 and P4 are obtained through processing through convolution of 3*3.

5. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: the step 4 model training comprises the following steps:

6. The method for segmenting and tracking based on attention and feature fusion according to claim 1, wherein: the step 5 model test comprises the following steps:

s1, testing the trained model in the VOT data set to obtain a tracking effect.