WO2023273136A1 - 一种基于目标物表征点估计的视觉跟踪方法 - Google Patents

一种基于目标物表征点估计的视觉跟踪方法 Download PDF

Info

Publication number
WO2023273136A1
WO2023273136A1 PCT/CN2021/133957 CN2021133957W WO2023273136A1 WO 2023273136 A1 WO2023273136 A1 WO 2023273136A1 CN 2021133957 W CN2021133957 W CN 2021133957W WO 2023273136 A1 WO2023273136 A1 WO 2023273136A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
frame
convolution
point
target frame
Prior art date
Application number
PCT/CN2021/133957
Other languages
English (en)
French (fr)
Inventor
钱诚
徐则中
游庆祥
刘冬
李春光
王甜
Original Assignee
常州工学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 常州工学院 filed Critical 常州工学院
Publication of WO2023273136A1 publication Critical patent/WO2023273136A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping

Definitions

  • the invention relates to the field of visual tracking, in particular to a visual tracking method based on estimation of target characteristic points.
  • the visual tracking method determines the spatial position of the target in each subsequent frame through the tracking method.
  • Visual tracking can be regarded as a target template matching problem, that is, according to the target image area marked in the first frame, it is used as a template to find matching objects in subsequent video sequences to determine the target image area.
  • a tracking framework based on Siamese network structure is proposed for image matching.
  • the purpose of the present invention is to provide a visual tracking method based on estimation of target characteristic points, so as to solve the problems mentioned in the background art.
  • a visual tracking method based on object characteristic point estimation comprising the following steps:
  • the target frame estimation module outputs the predicted target frame and the foreground and background classification module outputs a confidence map
  • the structure of the Siamese network includes a feature extraction module, a cross-correlation module, a target frame estimation module, and a foreground and background classification module; each convolutional neural network of the Siamese network is a backbone for extracting deep features A network module, the cross-correlation module calculates the matching likelihood between the target template feature and the search area feature, and the target frame estimation module outputs the target frame on the basis of the characteristic point estimation result.
  • the twin network has two branches composed of convolutional neural networks, and the backbone network of each convolutional neural network adopts a residual neural network ResNet-50, and the residual neural network ResNet-50 includes the first 1 convolutional block, 2nd convolutional block, 3rd convolutional block, 4th convolutional block, 5th convolutional block, in the 4th convolutional block and 5th convolutional block of residual neural network ResNet-50.
  • the downsampling operation is discarded and the hole convolution is used to expand the receptive field.
  • the hole rate in the fourth convolution block is set to 2
  • the hole rate in the fifth convolution block is set to 4.
  • the fourth convolution block and all The fifth convolutional block is used to extract the depth features of the target template image and the target search image respectively.
  • the output of the product block regards the feature map of the target template as a convolution kernel, and performs convolution calculation with the feature map of the search image, and uses the obtained cross-correlation feature map as the input for the subsequent front-background classification and target frame position estimation;
  • set the span parameter of the convolution as ⁇ (1,1), (1,2), (2,1) ⁇ to obtain 3 sets of cross-correlation feature maps, for each set of cross-correlation features
  • the three cross-correlation feature maps calculated by the 3rd convolution block, the 4th convolution block and the 5th convolution block are weighted and summed on the corresponding channels, and finally the cross-correlation feature map is obtained.
  • the target frame estimation module receives the cross-correlation feature map, and outputs the offset of the representative point corresponding to each feature point and the position of the target area representative point
  • the network structure of the target frame estimation module includes 2 branches , one of the branch layers consists of 4 layers of 256 input and output channels, a convolutional layer of 3 ⁇ 3 convolutional kernels, and a backbone layer consisting of 1 layer of 256 input channels, 18 output channels, and a convolutional layer of 1 ⁇ 1 convolutional kernels ;
  • Another branch layer contains a deformable convolution layer consisting of 1 layer of 256 output channels and 3 ⁇ 3 convolution kernels, and a layer of convolution consisting of 256 input channels, 18 output channels, and 1 ⁇ 1 convolution kernels Floor.
  • the backbone layer receives the cross-correlation feature map output by the cross-correlation module, which outputs the offset of each characteristic point as the displacement parameter of the characteristic point, and estimates the initial target frame to which each feature point belongs by the offset ;
  • the branch layer receives the feature map output of the third layer of the backbone layer, which outputs a further offset of the characterization point; the initial displacement of the characterization point output by the backbone layer is used to estimate the initial position of the characterization point, and then the initial characterization point
  • the position plus the offset of the representative point output by the branch layer can obtain the final position result of the representative point, and further directly obtain the target frame on the basis of the representative point.
  • the foreground and background classification module is composed of three convolutional layers and one deformable convolutional layer connected in sequence, the foreground and background classification module receives the cross-correlation feature map as input, and outputs the candidate corresponding to each feature point The classification confidence of the frame; the three layers of convolutional layers all have 256 input and output channels and 3 ⁇ 3 convolution kernels; the number of input channels of the deformable convolution is 256, with 3 ⁇ 3 convolution kernels, so The deformable convolution receives the displacement parameter of the characteristic point output by the target frame estimation module as the displacement parameter of the convolution kernel in the deformable convolution.
  • the training steps of the Siamese network include:
  • the training data selects the manually-labeled target detection image dataset VID and YouTube-BoundingBoxes dataset, randomly selects two frames of images with a frame number difference of no more than 20 frames from each video, and uses one of the frames as the The rectangular frame with the target as the center is the target image area.
  • the rectangular frame has a width of w and a height of h, it is scaled to a size of 127 ⁇ 127, which is the original image input of the target template; in another frame, it is cut out around the center of the target
  • the target search image area with a width of 2w and a height of 2h is then scaled to a size of 255 ⁇ 255; each pair of target template image and target search area image constitutes a training data;
  • the target frame estimation module loss functions about the predicted initial target frame position and the predicted final target frame position are respectively established for the backbone layer and the branch layer, and the backbone layer outputs 9 representations centered on each feature point
  • t represents the true value
  • v represents the predicted value
  • the branch layer uses the offset output by the backbone layer to perform a deformable convolution operation, and also outputs the offset of the representative point relative to the feature point, and constructs the predicted target frame on the basis of the representative point in the same way as the backbone layer , in the predicted target frame, select the predicted target frame with an intersection ratio greater than 0.5 with the real target frame as a positive example, and calculate the difference between the predicted target frame and the real target frame in the center point position and length and width by smoothing the L1 loss function as The target position loss is formula (2):
  • t represents the true value
  • u represents the predicted value
  • p is the cross-correlation feature map received by the front-background classification module
  • h is the convolution kernel
  • g is the label map in the form of a two-dimensional Gaussian function with the center coordinate of the real target frame as the mean value
  • L L cls + ⁇ 1 L loc1 + ⁇ 2 L loc2 (4);
  • ⁇ 1 and ⁇ 2 are regular parameters with positive values respectively.
  • the loss function of formula (4) is used to carry out backpropagation according to the input training data, and adjust the network parameters until the loss function converges.
  • the target tracking process steps of the twin network include:
  • Step 1 In the initial stage of target tracking, specify a target tracking frame in the first frame of video, and use the image in the tracking frame as the target image;
  • Step 2 In the subsequent tracking process, in the current frame, around the target frame in the previous frame, cut out an image area whose height and width are twice the height and width of the target frame in the previous frame as the target search image in the current frame area;
  • Step 3 Based on the trained twin network, input the target image obtained in step 1 and the target search image obtained in step 2 into the target template branch and the target search branch of the twin network respectively;
  • Step 4 constructing a predicted target frame with the characteristic points output by the branch of the target frame estimation module;
  • Step 5 The foreground and background classification module outputs the confidence degree of each feature point, and selects the predicted target frame corresponding to the feature point with the maximum confidence degree as the final target frame;
  • Step 6 Repeat steps 2 to 5 until the target tracking tasks on all video frames are completed.
  • the present invention mainly has the following beneficial effects:
  • the present invention proposes a visual tracking method based on feature point extraction, uses 9 feature points to describe the target, and further performs deformable convolution operation on the basis of these 9 points to extract more robust Target appearance characteristics.
  • the present invention estimates the offset parameter of the deformable convolution according to the characteristic points, and the extracted features are more targeted and more suitable for the requirements of the visual tracking task.
  • Fig. 1 is an overall network structure diagram of the present invention
  • Fig. 2 is a flow chart of the tracking method of the present invention
  • Fig. 3 is an analysis diagram of skier tracking results in the present invention.
  • a kind of visual tracking method based on target character point estimation comprises the following steps:
  • the target frame estimation module outputs the predicted target frame and the foreground and background classification module outputs a confidence map
  • the structure of the Siamese network includes a feature extraction module, a cross-correlation module, a target frame estimation module, and a front-background classification module; each convolutional neural network of the Siamese network is a backbone network module for extracting deep features, and the cross-correlation module Calculate the matching likelihood between the target template feature and the search area feature, and the target box estimation module outputs the target box on the basis of the estimation result of the characteristic point.
  • the Siamese network has two branches composed of convolutional neural networks.
  • the backbone network of each convolutional neural network adopts the residual neural network ResNet-50.
  • the residual neural network ResNet-50 includes the first convolutional block, the first 2 convolutional blocks, the 3rd convolutional block, the 4th convolutional block, and the 5th convolutional block, the downsampling operation is discarded in the 4th convolutional block and the 5th convolutional block of the residual neural network ResNet-50
  • use hole convolution to expand the receptive field where the hole rate in the 4th convolution block is set to 2, the hole rate in the 5th convolution block is set to 4, the 4th convolution block and the 5th convolution block are respectively used It is based on the extraction of deep features of target template images and target search images.
  • the feature results output by the third convolutional block, the fourth convolutional block, and the fifth convolutional block are used for fusion to overcome the differences in the features extracted by the multi-layer convolutional neural network.
  • the feature map of the target template is regarded as the convolution kernel, and is convoluted with the feature map of the search image, and the obtained cross-correlation feature map is used as the input of the subsequent front-background classification and target frame position estimation; when calculating the cross-correlation map
  • set the span parameters of the convolution according to ⁇ (1,1),(1,2),(2,1) ⁇ so as to obtain 3 sets of cross-correlation feature maps, for each set of cross-correlation feature maps, by the first
  • the three cross-correlation feature maps calculated by the 3rd convolution block, the 4th convolution block, and the 5th convolution block are weighted and summed on the corresponding channels, and finally the cross-correlation feature map is obtained.
  • the target frame estimation module receives the cross-correlation feature map, and outputs the offset of the representative point corresponding to each feature point and the position of the target area representative point.
  • the network structure of the target frame estimation module includes two branches, one of which is composed of 4 layers of 256 input and output channels, 3 ⁇ 3 convolution kernel convolution layer, and 1 layer of 256 input channels, 18 output channels, 1 ⁇ 1 convolution kernel convolution layer constitutes the backbone layer; another branch layer contains A deformable convolutional layer consisting of 1 layer of 256 output channels, 3 ⁇ 3 convolution kernels, and a layer of convolutional layers consisting of 256 input channels, 18 output channels, and 1 ⁇ 1 convolution kernels are proposed.
  • the backbone layer receives the cross-correlation feature map output by the cross-correlation module, which outputs the offset of each characteristic point as the displacement parameter of the characteristic point, and the initial target frame to which each feature point belongs is estimated by the offset;
  • the branch layer receives The feature map output of the third layer of the backbone layer, which outputs further offsets of the characterization points;
  • the displacement of the initial characterization points output by the backbone layer is used to estimate the initial position of the characterization points, and then the initial position of the characterization points plus the representation of the output of the branch layer
  • the point offset can get the final position result of the characterization point, and further obtain the target frame directly on the basis of the characterization point.
  • the front-background classification module is composed of 3 convolutional layers and 1-layer deformable convolutional layer connected sequentially.
  • the front-background classification module receives the cross-correlation feature map as input, and outputs the classification confidence of the candidate frame corresponding to each feature point;
  • the three convolutional layers have 256 input and output channels and 3 ⁇ 3 convolution kernels; the number of input channels of the deformable convolution is 256, with a 3 ⁇ 3 convolution kernel, and the deformable convolution receives the output of the target frame estimation module
  • the characteristic point displacement parameter is used as the displacement parameter of the convolution kernel in the deformable convolution.
  • the training steps of twin network include:
  • the training data selects the manually-labeled target detection image dataset VID and YouTube-BoundingBoxes dataset, randomly selects two frames of images with a frame number difference of no more than 20 frames from each video, and uses one of the frames as the The rectangular frame with the target as the center is the target image area.
  • the rectangular frame has a width of w and a height of h, it is scaled to a size of 127 ⁇ 127, which is the original image input of the target template; in another frame, it is cut out around the center of the target
  • the target search image area with a width of 2w and a height of 2h is then scaled to a size of 255 ⁇ 255; each pair of target template image and target search area image constitutes a training data;
  • the loss functions for the predicted initial target frame position and the predicted final target frame position are respectively established for the backbone layer and the branch layer, and the backbone layer outputs 9 representation points centered on each feature point.
  • t represents the true value
  • v represents the predicted value
  • the branch layer uses the offset output by the backbone layer to perform deformable convolution operations, and also outputs the offset of the characterization points relative to the feature points, and constructs the prediction target frame in the same way as the backbone layer based on the characterization points.
  • the predicted target frame select the predicted target frame with an intersection ratio greater than 0.5 with the real target frame as a positive example, and calculate the difference between the predicted target frame and the real target frame in the center point position and length and width by smoothing the L1 loss function as the target position Loss, as formula (2):
  • t represents the true value
  • u represents the predicted value
  • p is the cross-correlation feature map received by the front-background classification module
  • h is the convolution kernel
  • g is the label map in the form of a two-dimensional Gaussian function with the center coordinate of the real target frame as the mean value
  • L L cls + ⁇ 1 L loc1 + ⁇ 2 L loc2 (4);
  • ⁇ 1 and ⁇ 2 are regular parameters with positive values respectively.
  • the loss function of formula (4) is used to carry out backpropagation according to the input training data, and adjust the network parameters until the loss function converges.
  • the target tracking process steps of the Siamese network include:
  • Step 1 In the initial stage of target tracking, specify a target tracking frame in the first frame of video, and use the image in the tracking frame as the target image;
  • Step 2 In the subsequent tracking process, in the current frame, around the target frame in the previous frame, cut out an image area whose height and width are twice the height and width of the target frame in the previous frame as the target search image in the current frame area;
  • Step 3 Based on the trained twin network, input the target image obtained in step 1 and the target search image obtained in step 2 into the target template branch and the target search branch of the twin network respectively;
  • Step 4 constructing a predicted target frame with the characteristic points output by the branch of the target frame estimation module;
  • Step 5 The foreground and background classification module outputs the confidence degree of each feature point, and selects the predicted target frame corresponding to the feature point with the maximum confidence degree as the final target frame;
  • Step 6 Repeat steps 2 to 5 until the target tracking tasks on all video frames are completed.
  • the present invention proposes a visual tracking method based on feature point extraction, uses 9 feature points to describe the target, and further performs deformable convolution operation on the basis of these 9 points, so as to extract more robust Great target appearance feature.
  • the present invention estimates the offset parameter of the deformable convolution according to the characteristic points, and the extracted features are more targeted and more suitable for the requirements of the visual tracking task.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于目标物表征点估计的视觉跟踪方法,其技术方案要点是:包括以下步骤:S1、首帧中指定目标框作为目标模板;S2、在下一帧中裁剪出目标搜索图像区域;S3、将目标模板与搜索图像区域输入孪生网络;S4、目标框估计模块输出预测目标框和前背景分类模块输出置信度图;S5、取具有最大置信度目标框作为最终目标框,并重复S2-S5步骤;本发明根据表征点估计可形变卷积的偏移量参数,所提取的特征更具有针对性,更适合视觉跟踪任务要求。

Description

一种基于目标物表征点估计的视觉跟踪方法 技术领域
本发明涉及视觉跟踪领域,特别涉及一种基于目标物表征点估计的视觉跟踪方法。
背景技术
视觉跟踪根据视频第一帧中所要跟踪的目标对象,通过跟踪方法在后续每一帧中确定目标空间位置。视觉跟踪可以被视为是一种目标模板匹配问题,也即根据第一帧中标定的目标图像区域,将其作为模板在后续视频序列中寻找匹配对象,以此确定目标图像区域。相应地,基于孪生网络结构的跟踪框架被提出用于图像的匹配。
2018年发表在国际会议IEEEConferenceonComputerVisionandPatternRecognition上的题为《HighPerformanceVisualTrackingwithSiameseRegionProposalNetwork》将目标跟踪分为前背景分类和目标框回归两个子任务,借鉴目标检测的区域提议网络,引入了锚点框用于分类和回归计算。但是,该方法需要预设锚点框,这一方面需要关于锚点框参数的先验知识,另一方面大量锚点框的设置降低了计算效率。
在孪生网络被用于构建跟踪算法框架的过程中,目标物外观由于目标物自身形变造成其与目标模板之间的差异性过大,降低了匹配的准确性,严重削弱了孪生跟踪框架下匹配的有效性。针对这一问题,2020年发表在国际会议EuropeanConferenceonComputerVision上的题为《Ocean:Object-awareAnchor-freeTracking》的论文提出了一种目标区域感知的孪生网络框架,通过可形变卷积来配准目标区域,以此获得更为准确的目标特征,这在一定程度上缓解了目标物形变对模板匹配的不利影响。但是,该方法根据目标框位置估计结果来获取配准点,并在此基础上做可形变卷积提取目标外观特征,其中配准点是在边框上以固定的几何点方式采集获得,并不一定能够完全反映 目标形变。
发明内容
针对背景技术中提到的问题,本发明的目的是提供一种基于目标物表征点估计的视觉跟踪方法,以解决背景技术中提到的问题。
本发明的上述技术目的是通过以下技术方案得以实现的:
一种基于目标物表征点估计的视觉跟踪方法,包括以下步骤:
S1、首帧中指定目标框作为目标模板;
S2、在下一帧中裁剪出目标搜索图像区域;
S3、将目标模板与搜索图像区域输入孪生网络;
S4、目标框估计模块输出预测目标框和前背景分类模块输出置信度图;
S5、取具有最大置信度目标框作为最终目标框,并重复S2-S5步骤。
较佳的,所述孪生网络的结构包括特征提取模块、互相关模块、目标框估计模块、前背景分类模块;所述孪生网络的每一支卷积神经网络都是用于提取深度特征的主干网络模块,所述互相关模块计算目标模板特征与搜索区域特征之间的匹配似然度,所述目标框估计模块是在表征点估计结果的基础上输出目标框。
较佳的,所述孪生网络具有2支卷积神经网络构成的分支,每支所述卷积神经网络的主干网络都采用了残差神经网络ResNet-50,残差神经网络ResNet-50包含第1卷积块、第2卷积块、第3卷积块、第4卷积块、第5卷积块,在残差神经网络ResNet-50的第4卷积块、第5卷积块中舍去了下采样操作并采用空洞卷积来扩大感受野,其中第4卷积块中的空洞率设置为2,第5卷积块中的空洞率设置为4,第4卷积块和所述第5卷积块分别用于目标模板图像和目标搜索图像深度特征的提取。
较佳的,使用所述第3卷积块、第4卷积块和第5卷积块输出的特征结果进行融合克服多层卷积神经网络所提取的特征存在的差异性,对于每一个 卷积块的输出,将目标模板的特征图视作为卷积核,并与搜索图像的特征图作卷积计算,将获得的互相关特征图作为后续前背景分类、目标框位置估计的输入;在计算互相关图时,将卷积的跨度参数按{(1,1),(1,2),(2,1)}设置,从而得到3组互相关特征图,对于每一组互相关特征图,由第3卷积块、第4卷积块和第5卷积块计算所得的3个互相关特征图在对应通道上做加权求和操作,最后得到互相关特征图。
较佳的,所述目标框估计模块接收互相关特征图,输出每个特征点所对应表征点的偏移量和目标区域表征点位置,所述目标框估计模块的网络结构包括了2个分支,其中一个分支层由4层256输入输出通道、3×3卷积核的卷积层,以及1层256输入通道、18个输出通道、1×1卷积核的卷积层构成的主干层;另一个分支层包含了由1层256输出通道、3×3卷积核构成的可形变卷积层,以及1层由256输入通道、18输出通道、1×1卷积核构成的卷积层。
较佳的,所述主干层接收互相关模块输出的互相关特征图,其输出每个表征点的偏移量为表征点位移参数,由偏移量估计出每个特征点所属的初始目标框;所述分支层接收主干层第3层的特征图输出,其输出表征点进一步的偏移量;所述主干层输出的初始表征点位移量用来估计表征点初始位置,而后由表征点初始位置加上分支层输出的表征点偏移量可以得到表征点最终的位置结果,进一步在表征点的基础上直接得到目标框。
较佳的,所述前背景分类模块由3层卷积层和1层可形变卷积层依次连接构成,所述前背景分类模块接收互相关特征图作为输入,输出每个特征点所对应候选框的分类置信度;所述3层卷积层都具有256个输入输出通道、3×3卷积核;所述可形变卷积的输入通道数为256,具有3×3卷积核,所述可形变卷积接收目标框估计模块输出的表征点位移参数作为可形变卷积中卷积核的位移参数。
较佳的,所述孪生网络的训练步骤包括:
进行训练数据的准备:训练数据选用已手工标注的目标检测图像数据集VID和YouTube-BoundingBoxes数据集,从每段视频中任意选取帧数相差不大于20帧的两帧图像,以其中一帧中目标为中心的矩形框为目标图像区域,假设该矩形框宽度为w,高度为h,将其缩放至127×127大小,其为目标模板的原始图像输入;另一帧中围绕目标中心裁剪出宽度为2w,高度为2h的目标搜索图像区域,随后将其缩放至255×255大小;每一对目标模板图像与目标搜索区域图像构成了1个训练数据;
之后在所述目标框估计模块中,为主干层和分支层分别建立关于预测的初始目标框位置和预测的最终目标框位置的损失函数,将主干层以每个特征点为中心输出9个表征点的坐标偏移量,假设特征点坐标为(x,y),表征点相对于特征点的偏移量就为(Δx i,Δy i)(i=1,2,…,9),得到每个表征点的坐标就为(x+Δx i,y+Δy i);根据9个表征点构造预测目标框,目标框的左上角为
Figure PCTCN2021133957-appb-000001
右下角坐标为
Figure PCTCN2021133957-appb-000002
在预测目标框中,将包含真实目标框中心点的预测目标框作为正实例,通过平滑L1损失函数计算正实例目标框与真实目标框左上角点和右下角点位置差,作为目标位置损失,为公式(1):
Figure PCTCN2021133957-appb-000003
上式中,t表示真值,v表示预测值;
所述分支层利用主干层输出的偏移量做可形变卷积操作,同样输出表征点相对于特征点的偏移量,并在表征点的基础上采用与主干层相同的方式构造预测目标框,在预测目标框中,选取与真实目标框交并比大于0.5的预测目标框作为正实例,通过平滑L1损失函数计算预测目标框与真实目标框在中 心点位置和长宽上的差值作为目标位置损失,为公式(2):
Figure PCTCN2021133957-appb-000004
上式中t表示真值,u表示预测值;
利用前背景分类模块估计每个特征点属于目标框的置信度分数,其损失函数为关于分类误差的函数,为公式(3):
L cls=||p*h-g|| 2+λ||h|| 2  (3);
上式中,p是前背景分类模块所接收的互相关特征图,h是卷积核,g是以真实目标框中心坐标为均值的二维高斯函数形式标签图;
根据式(1)、式(2)和式(3),可以得到总体损失函数为公式(4):
L=L cls1L loc12L loc2   (4);
其中λ 1、λ 2分别为正数值的正则参数,最后利用公式(4)的损失函数根据输入的训练数据进行反向传播,调整网络参数至损失函数收敛。
较佳的,所述孪生网络的目标跟踪过程步骤包括:
步骤1、在目标跟踪开始阶段,在第一帧视频中指定目标跟踪框,并以跟踪框内的图像作为目标图像;
步骤2、在后续跟踪过程中,在当前帧中围绕上一帧中的目标框为中心裁剪出高和宽为上一帧目标框高和宽2倍的图像区域作为当前帧中的目标搜索图像区域;
步骤3、基于训练完毕的孪生网络,将步骤1中得到的目标图像和步骤2中得到的目标搜索图像分别输入孪生网络的目标模板分支和目标搜索分支;
步骤4、以目标框估计模块分支输出的表征点构造预测目标框;
步骤5、前背景分类模块输出每个特征点的置信度,选取具有最大置信度特征点所对应的预测目标框作为最终目标框;
步骤6、重复步骤2到步骤5,直到完成所有视频帧上的目标跟踪任务。
综上所述,本发明主要具有以下有益效果:
本发明提出了一种基于表征点提取的视觉跟踪方法,使用9个表征点来描述目标物,并进一步在这9个点基础上作可形变卷积操作,以此提取出更为鲁棒的目标外观特征。相比于上述基于目标感知的跟踪方法,本发明根据表征点估计可形变卷积的偏移量参数,所提取的特征更具有针对性,更适合视觉跟踪任务要求。
附图说明
图1是本发明的整体网络结构图;
图2是本发明的跟踪方法流程图;
图3是本发明中滑雪运动员跟踪结果分析图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
实施例1
参考图1至图3,一种基于目标物表征点估计的视觉跟踪方法,包括以下步骤:
S1、首帧中指定目标框作为目标模板;
S2、在下一帧中裁剪出目标搜索图像区域;
S3、将目标模板与搜索图像区域输入孪生网络;
S4、目标框估计模块输出预测目标框和前背景分类模块输出置信度图;
S5、取具有最大置信度目标框作为最终目标框,并重复S2-S5步骤。
其中,孪生网络的结构包括特征提取模块、互相关模块、目标框估计模 块、前背景分类模块;孪生网络的每一支卷积神经网络都是用于提取深度特征的主干网络模块,互相关模块计算目标模板特征与搜索区域特征之间的匹配似然度,目标框估计模块是在表征点估计结果的基础上输出目标框。
其中,孪生网络具有2支卷积神经网络构成的分支,每支卷积神经网络的主干网络都采用了残差神经网络ResNet-50,残差神经网络ResNet-50包含第1卷积块、第2卷积块、第3卷积块、第4卷积块、第5卷积块,在残差神经网络ResNet-50的第4卷积块、第5卷积块中舍去了下采样操作并采用空洞卷积来扩大感受野,其中第4卷积块中的空洞率设置为2,第5卷积块中的空洞率设置为4,第4卷积块和第5卷积块分别用于目标模板图像和目标搜索图像深度特征的提取。
其中,使用第3卷积块、第4卷积块和第5卷积块输出的特征结果进行融合克服多层卷积神经网络所提取的特征存在的差异性,对于每一个卷积块的输出,将目标模板的特征图视作为卷积核,并与搜索图像的特征图作卷积计算,将获得的互相关特征图作为后续前背景分类、目标框位置估计的输入;在计算互相关图时,将卷积的跨度参数按{(1,1),(1,2),(2,1)}设置,从而得到3组互相关特征图,对于每一组互相关特征图,由第3卷积块、第4卷积块和第5卷积块计算所得的3个互相关特征图在对应通道上做加权求和操作,最后得到互相关特征图。
其中,目标框估计模块接收互相关特征图,输出每个特征点所对应表征点的偏移量和目标区域表征点位置,目标框估计模块的网络结构包括了2个分支,其中一个分支层由4层256输入输出通道、3×3卷积核的卷积层,以及1层256输入通道、18个输出通道、1×1卷积核的卷积层构成的主干层;另一个分支层包含了由1层256输出通道、3×3卷积核构成的可形变卷积层,以及1层由256输入通道、18输出通道、1×1卷积核构成的卷积层。
其中,主干层接收互相关模块输出的互相关特征图,其输出每个表征点 的偏移量为表征点位移参数,由偏移量估计出每个特征点所属的初始目标框;分支层接收主干层第3层的特征图输出,其输出表征点进一步的偏移量;主干层输出的初始表征点位移量用来估计表征点初始位置,而后由表征点初始位置加上分支层输出的表征点偏移量可以得到表征点最终的位置结果,进一步在表征点的基础上直接得到目标框。
其中,前背景分类模块由3层卷积层和1层可形变卷积层依次连接构成,前背景分类模块接收互相关特征图作为输入,输出每个特征点所对应候选框的分类置信度;3层卷积层都具有256个输入输出通道、3×3卷积核;可形变卷积的输入通道数为256,具有3×3卷积核,可形变卷积接收目标框估计模块输出的表征点位移参数作为可形变卷积中卷积核的位移参数。
其中,孪生网络的训练步骤包括:
进行训练数据的准备:训练数据选用已手工标注的目标检测图像数据集VID和YouTube-BoundingBoxes数据集,从每段视频中任意选取帧数相差不大于20帧的两帧图像,以其中一帧中目标为中心的矩形框为目标图像区域,假设该矩形框宽度为w,高度为h,将其缩放至127×127大小,其为目标模板的原始图像输入;另一帧中围绕目标中心裁剪出宽度为2w,高度为2h的目标搜索图像区域,随后将其缩放至255×255大小;每一对目标模板图像与目标搜索区域图像构成了1个训练数据;
之后在目标框估计模块中,为主干层和分支层分别建立关于预测的初始目标框位置和预测的最终目标框位置的损失函数,将主干层以每个特征点为中心输出9个表征点的坐标偏移量,假设特征点坐标为(x,y),表征点相对于特征点的偏移量就为(Δx i,Δy i)(i=1,2,…,9),得到每个表征点的坐标就为(x+Δx i,y+Δy i);根据9个表征点构造预测目标框,目标框的左上角为
Figure PCTCN2021133957-appb-000005
右下角坐标为
Figure PCTCN2021133957-appb-000006
在预测目标框中,将包含真实目标框中心点的预测目标框作为正实例,通过平滑L1损失函数计算正实例目标框与真实目标框左上角点和右下角点位置差,作为目标位置损失,为公式(1):
Figure PCTCN2021133957-appb-000007
上式中,t表示真值,v表示预测值;
分支层利用主干层输出的偏移量做可形变卷积操作,同样输出表征点相对于特征点的偏移量,并在表征点的基础上采用与主干层相同的方式构造预测目标框,在预测目标框中,选取与真实目标框交并比大于0.5的预测目标框作为正实例,通过平滑L1损失函数计算预测目标框与真实目标框在中心点位置和长宽上的差值作为目标位置损失,为公式(2):
Figure PCTCN2021133957-appb-000008
上式中t表示真值,u表示预测值;
利用前背景分类模块估计每个特征点属于目标框的置信度分数,其损失函数为关于分类误差的函数,为公式(3):
L cls=||p*h-g|| 2+λ||h|| 2     (3);
上式中,p是前背景分类模块所接收的互相关特征图,h是卷积核,g是以真实目标框中心坐标为均值的二维高斯函数形式标签图;
根据式(1)、式(2)和式(3),可以得到总体损失函数为公式(4):
L=L cls1L loc12L loc2    (4);
其中λ 1、λ 2分别为正数值的正则参数,最后利用公式(4)的损失函数根据输入的训练数据进行反向传播,调整网络参数至损失函数收敛。
其中,孪生网络的目标跟踪过程步骤包括:
步骤1、在目标跟踪开始阶段,在第一帧视频中指定目标跟踪框,并以跟踪框内的图像作为目标图像;
步骤2、在后续跟踪过程中,在当前帧中围绕上一帧中的目标框为中心裁剪出高和宽为上一帧目标框高和宽2倍的图像区域作为当前帧中的目标搜索图像区域;
步骤3、基于训练完毕的孪生网络,将步骤1中得到的目标图像和步骤2中得到的目标搜索图像分别输入孪生网络的目标模板分支和目标搜索分支;
步骤4、以目标框估计模块分支输出的表征点构造预测目标框;
步骤5、前背景分类模块输出每个特征点的置信度,选取具有最大置信度特征点所对应的预测目标框作为最终目标框;
步骤6、重复步骤2到步骤5,直到完成所有视频帧上的目标跟踪任务。
其中,本发明提出了一种基于表征点提取的视觉跟踪方法,使用9个表征点来描述目标物,并进一步在这9个点基础上作可形变卷积操作,以此提取出更为鲁棒的目标外观特征。相比于上述基于目标感知的跟踪方法,本发明根据表征点估计可形变卷积的偏移量参数,所提取的特征更具有针对性,更适合视觉跟踪任务要求。
尽管已经示出和描述了本发明的实施例,对于本领域的普通技术人员而言,可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由所附权利要求及其等同物限定。

Claims (9)

  1. 一种基于目标物表征点估计的视觉跟踪方法,其特征在于:包括以下步骤:
    S1、首帧中指定目标框作为目标模板;
    S2、在下一帧中裁剪出目标搜索图像区域;
    S3、将目标模板与搜索图像区域输入孪生网络;
    S4、目标框估计模块输出预测目标框和前背景分类模块输出置信度图;
    S5、取具有最大置信度目标框作为最终目标框,并重复S2-S5步骤。
  2. 根据权利要求1所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:所述孪生网络的结构包括特征提取模块、互相关模块、目标框估计模块、前背景分类模块;所述孪生网络的每一支卷积神经网络都是用于提取深度特征的主干网络模块,所述互相关模块计算目标模板特征与搜索区域特征之间的匹配似然度,所述目标框估计模块是在表征点估计结果的基础上输出目标框。
  3. 根据权利要求1所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:所述孪生网络具有2支卷积神经网络构成的分支,每支所述卷积神经网络的主干网络都采用了残差神经网络ResNet-50,残差神经网络ResNet-50包含第1卷积块、第2卷积块、第3卷积块、第4卷积块、第5卷积块,在残差神经网络ResNet-50的第4卷积块、第5卷积块中舍去了下采样操作并采用空洞卷积来扩大感受野,其中第4卷积块中的空洞率设置为2,第5卷积块中的空洞率设置为4,第4卷积块和所述第5卷积块分别用于目标模板图像和目标搜索图像深度特征的提取。
  4. 根据权利要求3所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:使用所述第3卷积块、第4卷积块和第5卷积块输出的特征结果进行融合克服多层卷积神经网络所提取的特征存在的差异性,对于每一个卷积块的输出,将目标模板的特征图视作为卷积核,并与搜索图像的特征图 作卷积计算,将获得的互相关特征图作为后续前背景分类、目标框位置估计的输入;在计算互相关图时,将卷积的跨度参数按{(1,1),(1,2),(2,1)}设置,从而得到3组互相关特征图,对于每一组互相关特征图,由第3卷积块、第4卷积块和第5卷积块计算所得的3个互相关特征图在对应通道上做加权求和操作,最后得到互相关特征图。
  5. 根据权利要求2所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:所述目标框估计模块接收互相关特征图,输出每个特征点所对应表征点的偏移量和目标区域表征点位置,所述目标框估计模块的网络结构包括了2个分支,其中一个分支层由4层256输入输出通道、3×3卷积核的卷积层,以及1层256输入通道、18个输出通道、1×1卷积核的卷积层构成的主干层;另一个分支层包含了由1层256输出通道、3×3卷积核构成的可形变卷积层,以及1层由256输入通道、18输出通道、1×1卷积核构成的卷积层。
  6. 根据权利要求5所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:所述主干层接收互相关模块输出的互相关特征图,其输出每个表征点的偏移量为表征点位移参数,由偏移量估计出每个特征点所属的初始目标框;所述分支层接收主干层第3层的特征图输出,其输出表征点进一步的偏移量;所述主干层输出的初始表征点位移量用来估计表征点初始位置,而后由表征点初始位置加上分支层输出的表征点偏移量可以得到表征点最终的位置结果,进一步在表征点的基础上直接得到目标框。
  7. 根据权利要求2所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:所述前背景分类模块由3层卷积层和1层可形变卷积层依次连接构成,所述前背景分类模块接收互相关特征图作为输入,输出每个特征点所对应候选框的分类置信度;所述3层卷积层都具有256个输入输出通道、3×3卷积核;所述可形变卷积的输入通道数为256,具有3×3卷积核,所述 可形变卷积接收目标框估计模块输出的表征点位移参数作为可形变卷积中卷积核的位移参数。
  8. 根据权利要求6所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:所述孪生网络的训练步骤包括:
    进行训练数据的准备:训练数据选用已手工标注的目标检测图像数据集VID和YouTube-BoundingBoxes数据集,从每段视频中任意选取帧数相差不大于20帧的两帧图像,以其中一帧中目标为中心的矩形框为目标图像区域,假设该矩形框宽度为w,高度为h,将其缩放至127×127大小,其为目标模板的原始图像输入;另一帧中围绕目标中心裁剪出宽度为2w,高度为2h的目标搜索图像区域,随后将其缩放至255×255大小;每一对目标模板图像与目标搜索区域图像构成了1个训练数据;
    之后在所述目标框估计模块中,为主干层和分支层分别建立关于预测的初始目标框位置和预测的最终目标框位置的损失函数,将主干层以每个特征点为中心输出9个表征点的坐标偏移量,假设特征点坐标为(x,y),表征点相对于特征点的偏移量就为(Δx i,Δy i)(i=1,2,…,9),得到每个表征点的坐标就为(x+Δx i,y+Δy i);根据9个表征点构造预测目标框,目标框的左上角为
    Figure PCTCN2021133957-appb-100001
    右下角坐标为
    Figure PCTCN2021133957-appb-100002
    在预测目标框中,将包含真实目标框中心点的预测目标框作为正实例,通过平滑L1损失函数计算正实例目标框与真实目标框左上角点和右下角点位置差,作为目标位置损失,为公式(1):
    Figure PCTCN2021133957-appb-100003
    上式中,t表示真值,v表示预测值;
    所述分支层利用主干层输出的偏移量做可形变卷积操作,同样输出表征 点相对于特征点的偏移量,并在表征点的基础上采用与主干层相同的方式构造预测目标框,在预测目标框中,选取与真实目标框交并比大于0.5的预测目标框作为正实例,通过平滑L1损失函数计算预测目标框与真实目标框在中心点位置和长宽上的差值作为目标位置损失,为公式(2):
    Figure PCTCN2021133957-appb-100004
    上式中t表示真值,u表示预测值;
    利用前背景分类模块估计每个特征点属于目标框的置信度分数,其损失函数为关于分类误差的函数,为公式(3):
    L cls=||p*h-g|| 2+λ||h|| 2(3);
    上式中,p是前背景分类模块所接收的互相关特征图,h是卷积核,g是以真实目标框中心坐标为均值的二维高斯函数形式标签图;
    根据式(1)、式(2)和式(3),可以得到总体损失函数为公式(4):
    L=L cls1L loc12L loc2(4);
    其中λ 1、λ 2分别为正数值的正则参数,最后利用公式(4)的损失函数根据输入的训练数据进行反向传播,调整网络参数至损失函数收敛。
  9. 根据权利要求1所述的一种基于目标物表征点估计的视觉跟踪方法,其特征在于:所述孪生网络的目标跟踪过程步骤包括:
    步骤1、在目标跟踪开始阶段,在第一帧视频中指定目标跟踪框,并以跟踪框内的图像作为目标图像;
    步骤2、在后续跟踪过程中,在当前帧中围绕上一帧中的目标框为中心裁剪出高和宽为上一帧目标框高和宽2倍的图像区域作为当前帧中的目标搜索图像区域;
    步骤3、基于训练完毕的孪生网络,将步骤1中得到的目标图像和步骤2 中得到的目标搜索图像分别输入孪生网络的目标模板分支和目标搜索分支;
    步骤4、以目标框估计模块分支输出的表征点构造预测目标框;
    步骤5、前背景分类模块输出每个特征点的置信度,选取具有最大置信度特征点所对应的预测目标框作为最终目标框;
    步骤6、重复步骤2到步骤5,直到完成所有视频帧上的目标跟踪任务。
PCT/CN2021/133957 2021-06-29 2021-11-29 一种基于目标物表征点估计的视觉跟踪方法 WO2023273136A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110723916.X 2021-06-29
CN202110723916.XA CN113344976B (zh) 2021-06-29 2021-06-29 一种基于目标物表征点估计的视觉跟踪方法

Publications (1)

Publication Number Publication Date
WO2023273136A1 true WO2023273136A1 (zh) 2023-01-05

Family

ID=77481178

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133957 WO2023273136A1 (zh) 2021-06-29 2021-11-29 一种基于目标物表征点估计的视觉跟踪方法

Country Status (2)

Country Link
CN (1) CN113344976B (zh)
WO (1) WO2023273136A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152298A (zh) * 2023-04-17 2023-05-23 中国科学技术大学 一种基于自适应局部挖掘的目标跟踪方法
CN116403006A (zh) * 2023-06-07 2023-07-07 南京军拓信息科技有限公司 实时视觉目标跟踪方法、装置、及存储介质
CN116572264A (zh) * 2023-05-22 2023-08-11 中铁九局集团电务工程有限公司 一种基于轻量模型的软体机械臂自由眼系统目标追踪方法
CN116645399A (zh) * 2023-07-19 2023-08-25 山东大学 基于注意力机制的残差网络目标跟踪方法及系统
CN116665133A (zh) * 2023-07-24 2023-08-29 山东科技大学 基于三元组网络的安全帽检测跟踪方法、设备及存储介质
CN117252904A (zh) * 2023-11-15 2023-12-19 南昌工程学院 基于长程空间感知与通道增强的目标跟踪方法与系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344976B (zh) * 2021-06-29 2024-01-23 常州工学院 一种基于目标物表征点估计的视觉跟踪方法
CN113870330B (zh) * 2021-09-30 2023-05-12 四川大学 基于特定标签和损失函数的孪生视觉跟踪方法
CN114596338B (zh) * 2022-05-09 2022-08-16 四川大学 一种考虑时序关系的孪生网络目标跟踪方法
CN114723752A (zh) * 2022-06-07 2022-07-08 成都新西旺自动化科技有限公司 一种融合目标检测和模板匹配的高精度对位方法及系统
CN115588030B (zh) * 2022-09-27 2023-09-12 湖北工业大学 基于孪生网络的视觉目标跟踪方法及设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10664722B1 (en) * 2016-10-05 2020-05-26 Digimarc Corporation Image processing arrangements
CN111639551A (zh) * 2020-05-12 2020-09-08 华中科技大学 基于孪生网络和长短期线索的在线多目标跟踪方法和系统
CN113344976A (zh) * 2021-06-29 2021-09-03 常州工学院 一种基于目标物表征点估计的视觉跟踪方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807793B (zh) * 2019-09-29 2022-04-22 南京大学 一种基于孪生网络的目标跟踪方法
CN111462175B (zh) * 2020-03-11 2023-02-10 华南理工大学 时空卷积孪生匹配网络目标跟踪方法、装置、介质及设备
CN111429482A (zh) * 2020-03-19 2020-07-17 上海眼控科技股份有限公司 目标跟踪方法、装置、计算机设备和存储介质
CN112365523A (zh) * 2020-11-05 2021-02-12 常州工学院 基于无锚点孪生网络关键点检测的目标跟踪方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10664722B1 (en) * 2016-10-05 2020-05-26 Digimarc Corporation Image processing arrangements
CN111639551A (zh) * 2020-05-12 2020-09-08 华中科技大学 基于孪生网络和长短期线索的在线多目标跟踪方法和系统
CN113344976A (zh) * 2021-06-29 2021-09-03 常州工学院 一种基于目标物表征点估计的视觉跟踪方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YANG ZE; LIU SHAOHUI; HU HAN; WANG LIWEI; LIN STEPHEN: "RepPoints: Point Set Representation for Object Detection", 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 27 October 2019 (2019-10-27), pages 9656 - 9665, XP033723350, DOI: 10.1109/ICCV.2019.00975 *
ZHIPENG ZHANG; HOUWEN PENG; JIANLONG FU; BING LI; WEIMING HU: "Ocean: Object-aware Anchor-free Tracking", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 July 2020 (2020-07-09), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081712203 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152298A (zh) * 2023-04-17 2023-05-23 中国科学技术大学 一种基于自适应局部挖掘的目标跟踪方法
CN116152298B (zh) * 2023-04-17 2023-08-29 中国科学技术大学 一种基于自适应局部挖掘的目标跟踪方法
CN116572264A (zh) * 2023-05-22 2023-08-11 中铁九局集团电务工程有限公司 一种基于轻量模型的软体机械臂自由眼系统目标追踪方法
CN116403006A (zh) * 2023-06-07 2023-07-07 南京军拓信息科技有限公司 实时视觉目标跟踪方法、装置、及存储介质
CN116403006B (zh) * 2023-06-07 2023-08-29 南京军拓信息科技有限公司 实时视觉目标跟踪方法、装置、及存储介质
CN116645399A (zh) * 2023-07-19 2023-08-25 山东大学 基于注意力机制的残差网络目标跟踪方法及系统
CN116645399B (zh) * 2023-07-19 2023-10-13 山东大学 基于注意力机制的残差网络目标跟踪方法及系统
CN116665133A (zh) * 2023-07-24 2023-08-29 山东科技大学 基于三元组网络的安全帽检测跟踪方法、设备及存储介质
CN116665133B (zh) * 2023-07-24 2023-10-13 山东科技大学 基于三元组网络的安全帽检测跟踪方法、设备及存储介质
CN117252904A (zh) * 2023-11-15 2023-12-19 南昌工程学院 基于长程空间感知与通道增强的目标跟踪方法与系统
CN117252904B (zh) * 2023-11-15 2024-02-09 南昌工程学院 基于长程空间感知与通道增强的目标跟踪方法与系统

Also Published As

Publication number Publication date
CN113344976B (zh) 2024-01-23
CN113344976A (zh) 2021-09-03

Similar Documents

Publication Publication Date Title
WO2023273136A1 (zh) 一种基于目标物表征点估计的视觉跟踪方法
CN109191491B (zh) 基于多层特征融合的全卷积孪生网络的目标跟踪方法及系统
CN108682017B (zh) 基于Node2Vec算法的超像素图像边缘检测方法
CN110210551A (zh) 一种基于自适应主体敏感的视觉目标跟踪方法
CN110853026B (zh) 一种融合深度学习与区域分割的遥感影像变化检测方法
CN112434655B (zh) 一种基于自适应置信度图卷积网络的步态识别方法
CN111161317A (zh) 一种基于多重网络的单目标跟踪方法
CN102075686B (zh) 一种鲁棒的实时在线摄像机跟踪方法
CN111260661B (zh) 一种基于神经网络技术的视觉语义slam系统及方法
CN109902565B (zh) 多特征融合的人体行为识别方法
CN113706581B (zh) 基于残差通道注意与多层次分类回归的目标跟踪方法
CN114187665B (zh) 一种基于人体骨架热图的多人步态识别方法
CN110895683B (zh) 一种基于Kinect的单视点手势姿势识别方法
CN112365523A (zh) 基于无锚点孪生网络关键点检测的目标跟踪方法及装置
CN106780450A (zh) 一种基于低秩多尺度融合的图像显著性检测方法
CN107424161A (zh) 一种由粗至精的室内场景图像布局估计方法
CN113963032A (zh) 一种融合目标重识别的孪生网络结构目标跟踪方法
CN106204637A (zh) 光流计算方法
CN107944437A (zh) 一种基于神经网络和积分图像的人脸定位方法
CN114036969A (zh) 一种多视角情况下的3d人体动作识别算法
CN112396036A (zh) 一种结合空间变换网络和多尺度特征提取的遮挡行人重识别方法
CN111291687A (zh) 一种3d人体动作标准性判识的方法
CN116311518A (zh) 一种基于人体交互意图信息的层级人物交互检测方法
CN114862904B (zh) 一种水下机器人的孪生网络目标连续跟踪方法
CN116051601A (zh) 一种深度时空关联的视频目标跟踪方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21948062

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE