一种融合光流信息和Siamese框架的目标跟踪方法及装置Target tracking method and device combining optical flow information and Siamese frame
技术领域Technical field
本发明涉及图像识别领域,尤其涉及一种融合光流信息和Siamese框架的目标跟踪方法及装置。The present invention relates to the field of image recognition, in particular to a target tracking method and device combining optical flow information and Siamese framework.
背景技术Background technique
随着计算机视觉的快速发展,单目标跟踪越来越受到大众的关注。跟踪算法从卡尔曼、粒子滤波器和特征点匹配的生成式模型算法到现在的基于相关滤波框架和Siamese(孪生)框架的差别式模型算法,跟踪算法的精度及运算速度在不断提高。With the rapid development of computer vision, single target tracking has attracted more and more attention from the public. Tracking algorithm From the generative model algorithm of Kalman, particle filter and feature point matching to the current differential model algorithm based on the correlation filtering framework and Siamese (twin) framework, the accuracy and operation speed of the tracking algorithm are continuously improving.
基于特征点匹配的生成式模型算法的优点是模型结构简单,无训练过程,但是计算精度不高,有遮挡时特征点会消失;基于Siamese框架的全卷积网络模型算法计算速度快,但只考虑了图像的外观特征,无法跟踪背景复杂以及剧烈运动的对象。The advantage of the generative model algorithm based on feature point matching is that the model structure is simple, there is no training process, but the calculation accuracy is not high, and the feature points will disappear when there is occlusion; the full convolutional network model algorithm based on the Siamese framework has a fast calculation speed, but only Taking into account the appearance characteristics of the image, it is impossible to track objects with complex backgrounds and violent movements.
发明内容Summary of the invention
为解决上述技术问题,本发明提出了一种融合光流信息和Siamese框架的目标跟踪方法及装置,用以解决现有技术中基于特征点匹配的生成式模型算法计算精度不高、基于Siamese框架的全卷积网络模型算法无法跟踪背景复杂以及剧烈运动的对象的技术问题。In order to solve the above technical problems, the present invention proposes a target tracking method and device combining optical flow information and Siamese framework to solve the problem of low calculation accuracy of generative model algorithms based on feature point matching in the prior art and based on Siamese framework. The full convolutional network model algorithm can not track the technical problem of complex background and violently moving objects.
根据本发明的第一方面,提供一种融合光流信息和Siamese框架的目标跟踪方法,包括:According to the first aspect of the present invention, a target tracking method fusing optical flow information and Siamese framework is provided, including:
S101:获取当前帧,当前帧为第N帧,其中N>3,再获取当前帧的前面的三帧,分别是第N-3帧、第N-2帧、第N-1帧,所述第N-3帧、第N-2帧、第N-1帧分别和当前第N帧使用TVNet光流网络来计算光流,得到Flow1、Flow2、Flow3;并对Flow1、Flow2及Flow3进行裁剪(Crop)操作,得到22×22的光流矢量图P1、P2、P3;将当前帧输入特征网络,得到22×22的当前帧特征图F
N;将当前帧特征图F
N分别与光流矢量图P1、P2、P3结合,再对结合后的结果进行变形(Warp)操作,得到变形后的特征图F
1、F
2、F
3;
S101: Obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, namely the N-3th frame, the N-2th frame, and the N-1th frame. Frame N-3, Frame N-2, Frame N-1 and the current frame N respectively use the TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and crop Flow1, Flow2, and Flow3 ( Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F N ; separate the current frame feature map F N with the optical flow vector Figures P1, P2, and P3 are combined, and then warp the combined result to obtain the deformed feature maps F 1 , F 2 , F 3 ;
S102:将变形后的特征图F
1、F
2、F
3与当前帧特征图F
N作为检测帧输入时序打分模型,得到所述候选检测帧的特征权重,并将所述候选检测帧的特征权重与融合了光流特征的候选检测帧按照公式(1)相乘得到最终的检测帧;
S102: Use the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N as the detection frame input timing scoring model to obtain the feature weights of the candidate detection frames, and combine the features of the candidate detection frames The weight and the candidate detection frame fused with optical flow features are multiplied according to formula (1) to obtain the final detection frame;
i表示当前帧的序号,I
i指当前帧第i帧,I
j指在当前帧I
i前面的某一帧如第j帧,j∈{i-T,…,i-2,i-1},T=3,即当前帧的前面的三帧;
是当前帧通过融合其他帧光流信息后得到的最终的检测帧,w
j->i表示由时序打分模型计算并输出的候选检测帧的特征权重;f
j->i是将第j帧中的运动信息通过光流网络映射到第i帧,然后再将得出的光流结果图与第j帧图像进行变形(Warp)操作;
i represents the sequence number of the current frame, I i refers to the i-th frame of the current frame, I j refers to a certain frame before the current frame I i , such as the j-th frame, j∈{iT,...,i-2,i-1}, T=3, that is, the previous three frames of the current frame; It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w j->i represents the feature weight of the candidate detection frame calculated and output by the timing scoring model; f j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;
将第j帧中的运动信息通过光流网络映射到第i帧定义为Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as
f
j→i=W(f
j,M
i→j)
=W(f
j,F(I
i,I
j))
f j→i =W(f j ,M i→j ) = W(f j ,F(I i ,I j ))
其中,F(I
i,I
j)是通过所述光流网络对I
i和I
j进行光流计算,得出的结果实现了将第j帧中的运动信息映射到第i帧;f
j是第i帧的特征图,W(,)是对所述光流计算得出的结果与I
j帧融合,对融合后的信息,进行变形(Warp)操作,应用到每个通道特征映射定位的线性形变方程进行变形(Warp)操作;
Wherein, F(I i , I j ) is the optical flow calculation of I i and I j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f j Is the feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel Warp operation with the linear deformation equation;
其中,所述时序打分模型输入为未经打分的各个时段的变形后的特征图F
1、F
2、F
3与当前帧特征图F
N,输出为候选检测帧的权重数值;
Wherein, the input of the time series scoring model is the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N in each time period without scoring, and the output is the weight value of the candidate detection frame;
所述时序打分模型具有池化层,其中的池化层可以执行全局平均池化操作 和全局最大值池化操作,通过全局平均池化操作和全局最大值池化,对每个候选检测帧包含物体的信息量进行打分,得到操作后的中间矩阵,The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. Through the global average pooling operation and the global maximum pooling, each candidate detection frame contains The information volume of the object is scored, and the intermediate matrix after the operation is obtained,
所述全局平均池化操作为:The global average pooling operation is:
其中G
S-GA(...)表示全局平均池化过程。q
T表示T个候选检测帧,q
x和q
y表示特征图中的像素点,H表示输入到全局平均池化操作前特征图的高,W表示输入到全局平均池化操作前特征图的宽;
Where G S-GA (...) represents the global average pooling process. q T represents T candidate detection frames, q x and q y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width;
所述全局最大值池化操作为:The global maximum pooling operation is:
G
S-GM(q
T)=Max(q
T(q
x,q
y))
G S-GM (q T )=Max(q T (q x ,q y ))
G
S-GM(...)表示全局最大值池化过程;
G S-GM (...) represents the global maximum pooling process;
所述这全局平均池化操作输出一个T×1维的向量,构成全局平均池化中间矩阵,所述全局最大值池化操作也输出一个T×1维的向量,构成全局最大值池化中间矩阵;The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix;
将所述全局平均池化中间矩阵和所述全局最大值池化中间矩阵输入共享网络层,对每个候选帧与当前帧的关联性进行打分;通过共享网络层分别得到全局平均池化和最大值池化的权值矩阵,所述共享网络层实现卷积操作,参数由经验值或训练得到;再对两个权值矩阵进行逐元素相加操作,得到权重特征向量;并将得到的权重特征向量作为激活函数Relu的输入,所述激活函数Relu为:The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the correlation between each candidate frame and the current frame is scored; through the shared network layer, the global average pooling and maximum are obtained respectively Value pooled weight matrix, the shared network layer implements convolution operation, and the parameters are obtained by experience or training; then two weight matrixes are added element by element to obtain the weight feature vector; and the weight obtained The feature vector is used as the input of the activation function Relu, and the activation function Relu is:
其中,x指输入的所述权重特征向量,α为系数,Where x refers to the input weight feature vector, α is the coefficient,
所述时序打分模型是由卷积神经网络模型根据损失函数进行训练的。The time series scoring model is trained by the convolutional neural network model according to the loss function.
进一步地,further,
所述时序打分模型是由卷积神经网络模型根据损失函数进行训练的,所述损失函数为:The time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:
l(y,v)=log(1+exp(-yv))l(y,v)=log(1+exp(-yv))
其中v表示训练集中等待训练的图像的候选响应图每个点的真实值,y∈{+1,-1}表示标准跟踪框的标签;通过最小化上述损失函数来不断学习、训练,当所述损失函数趋于稳定时,所述时序打分模型训练完毕,得到所述时序打分模型的系数,利用训练好的时序打分模型对候选检测帧的权重数值进行计算,从而得到候选检测帧时序权重。Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame; continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the training of the time series scoring model is completed, the coefficients of the time series scoring model are obtained, and the weight values of candidate detection frames are calculated using the trained time series scoring model to obtain the candidate detection frame time sequence weights.
进一步地,为了更好的提取候选检测帧的图像特征,所述共享网络层中的卷积神经滤波器采用可变形的卷积计算,在传统的卷积操作的作用区域上,加入了一个可学习的参数Δpn。Further, in order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and adds a variable to the traditional convolution operation area. The learned parameter Δpn.
根据本发明第二方面,提供一种融合光流信息和Siamese框架的目标跟踪装置,包括:According to the second aspect of the present invention, a target tracking device integrating optical flow information and Siamese framework is provided, including:
获取特征模块:用于获取当前帧,当前帧为第N帧,其中N>3,再获取当前帧的前面的三帧,分别是第N-3帧、第N-2帧、第N-1帧,所述第N-3帧、第N-2帧、第N-1帧分别和当前第N帧使用TVNet光流网络来计算光流,得到Flow1、Flow2、Flow3;并对Flow1、Flow2及Flow3进行裁剪(Crop)操作,得到22×22的光流矢量图P1、P2、P3;将当前帧输入特征网络,得到22×22的当前帧特征图F
N;将当前帧特征图F
N分别与光流矢量图P1、P2、P3结合,再对结合后的结果进行变形(Warp)操作,得到变形后的特征图F
1、F
2、F
3;
Obtain feature module: used to obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, which are the N-3th frame, the N-2th frame, and the N-1th frame. Frame, the N-3th, N-2th, N-1th frame and the current Nth frame respectively use TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and for Flow1, Flow2 and Flow3 performs cropping (Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F N ; separate the current frame feature map F N Combine with the optical flow vector diagrams P1, P2, P3, and then perform a warp operation on the combined result to obtain the deformed feature maps F 1 , F 2 , F 3 ;
权重计算模块:用于将变形后的特征图F
1、F
2、F
3与当前帧特征图F
N作为检测帧输入时序打分模型,得到所述候选检测帧的特征权重,并将所述候选检测帧的特征权重与融合了光流特征的候选检测帧按照公式(1)相乘得到最终的检测帧;
Weight calculation module: used to use the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N as the detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and compare the candidate The feature weight of the detection frame is multiplied by the candidate detection frame fused with optical flow features according to formula (1) to obtain the final detection frame;
i表示当前帧的序号,I
i指当前帧第i帧,I
j指在当前帧I
i前面的某一帧如第j帧,j∈{i-T,…,i-2,i-1},T=3,即当前帧的前面的三帧;
是当前帧通过融合其他帧光流信息后得到的最终的检测帧,w
j->i表示由时序打分模型计算并输出的候选检测帧的特征权重;f
j->i是将第j帧中的运动信息通过光流网络映射到第i帧,然后再将得出的光流结果图与第j帧图像进行变形(Warp)操作;
i represents the sequence number of the current frame, I i refers to the i-th frame of the current frame, I j refers to a certain frame before the current frame I i , such as the j-th frame, j∈{iT,...,i-2,i-1}, T=3, that is, the previous three frames of the current frame; It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w j->i represents the feature weight of the candidate detection frame calculated and output by the time series scoring model; f j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;
将第j帧中的运动信息通过光流网络映射到第i帧定义为Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as
f
j→i=W(f
j,M
i→j)=W(f
j,F(I
i,I
j))
f j→i =W(f j ,M i→j )=W(f j ,F(I i ,I j ))
其中,F(I
i,I
j)是通过所述光流网络对I
i和I
j进行光流计算,得出的结果实现了将第j帧中的运动信息映射到第i帧;f
j是第i帧的特征图,W(,)是对所述光流计算得出的结果与I
j帧融合,对融合后的信息,进行变形(Warp)操作,应用到每个通道特征映射定位的线性形变方程进行变形(Warp)操作;
Wherein, F(I i , I j ) is the optical flow calculation of I i and I j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f j Is the feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel Warp operation with the linear deformation equation;
其中,所述时序打分模型输入为未经打分的各个时段的变形后的特征图F
1、F
2、F
3与当前帧特征图F
N,输出为候选检测帧的权重数值;
Wherein, the input of the time series scoring model is the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N in each time period without scoring, and the output is the weight value of the candidate detection frame;
所述时序打分模型具有池化层,其中的池化层可以执行全局平均池化操作和全局最大值池化操作,通过全局平均池化操作和全局最大值池化,对每个候选检测帧包含物体的信息量进行打分,得到操作后的中间矩阵,The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. Through the global average pooling operation and the global maximum pooling, each candidate detection frame contains The information volume of the object is scored, and the intermediate matrix after the operation is obtained,
所述全局平均池化操作为:The global average pooling operation is:
其中G
S-GA(...)表示全局平均池化过程。q
T表示T个候选检测帧,q
x和q
y表示特征图中的像素点,H表示输入到全局平均池化操作前特征图的高,W表示输入到全局平均池化操作前特征图的宽;
Where G S-GA (...) represents the global average pooling process. q T represents T candidate detection frames, q x and q y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width;
所述全局最大值池化操作为:The global maximum pooling operation is:
G
S-GM(q
T)=Max(q
T(q
x,q
y))
G S-GM (q T )=Max(q T (q x ,q y ))
G
S-GM(...)表示全局最大值池化过程;
G S-GM (...) represents the global maximum pooling process;
所述这全局平均池化操作输出一个T×1维的向量,构成全局平均池化中间矩阵,所述全局最大值池化操作也输出一个T×1维的向量,构成全局最大值池化中间矩阵;The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix;
将所述全局平均池化中间矩阵和所述全局最大值池化中间矩阵输入共享网络层,对每个候选帧与当前帧的关联性进行打分;通过共享网络层分别得到全局平均池化和最大值池化的权值矩阵,所述共享网络层实现卷积操作,参数由经验值或训练得到;再对两个权值矩阵进行逐元素相加操作,得到权重特征向量;并将得到的权重特征向量作为激活函数Relu的输入,所述激活函数Relu为:The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the correlation between each candidate frame and the current frame is scored; through the shared network layer, the global average pooling and maximum are obtained respectively Value pooled weight matrix, the shared network layer implements convolution operation, and the parameters are obtained by experience or training; then two weight matrixes are added element by element to obtain the weight feature vector; and the weight obtained The feature vector is used as the input of the activation function Relu, and the activation function Relu is:
其中,x指输入的所述权重特征向量,α为系数,Where x refers to the input weight feature vector, α is the coefficient,
所述时序打分模型是由卷积神经网络模型根据损失函数进行训练的。The time series scoring model is trained by the convolutional neural network model according to the loss function.
进一步地,所述时序打分模型是由卷积神经网络模型根据损失函数进行训练的,所述损失函数为:Further, the time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:
l(y,v)=log(1+exp(-yv))l(y,v)=log(1+exp(-yv))
其中v表示训练集中等待训练的图像的候选响应图每个点的真实值,y∈{+1,-1}表示标准跟踪框的标签;通过最小化上述损失函数来不断学习、训练,当所述损失函数趋于稳定时,所述时序打分模型训练完毕,得到所述时序打分模型的系数,利用训练好的时序打分模型对候选检测帧的权重数值进行计算,从而得到候选检测帧时序权重。Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame; continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the training of the time series scoring model is completed, the coefficients of the time series scoring model are obtained, and the weight values of candidate detection frames are calculated using the trained time series scoring model to obtain the candidate detection frame time sequence weights.
进一步地,为了更好的提取候选检测帧的图像特征,所述共享网络层中的卷积神经滤波器采用可变形的卷积计算,在传统的卷积操作的作用区域上,加入了一个可学习的参数Δpn。Further, in order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and adds a variable to the traditional convolution operation area. The learned parameter Δpn.
根据本发明第三方面,提供一种融合光流信息和Siamese框架的目标跟踪系统,包括:According to the third aspect of the present invention, a target tracking system integrating optical flow information and Siamese framework is provided, including:
处理器,用于执行多条指令;Processor, used to execute multiple instructions;
存储器,用于存储多条指令;Memory, used to store multiple instructions;
其中,所述多条指令,用于由所述存储器存储,并由所述处理器加载并执行如前所述的融合光流信息和Siamese框架的目标跟踪方法。Wherein, the multiple instructions are used to be stored by the memory and loaded by the processor to execute the aforementioned target tracking method combining optical flow information and Siamese framework.
根据本发明第四方面,提供一种计算机可读存储介质,所述存储介质中存储有多条指令;所述多条指令,用于由处理器加载并执行如前所述的融合光流信息和Siamese框架的目标跟踪方法。According to a fourth aspect of the present invention, there is provided a computer-readable storage medium in which a plurality of instructions are stored; the plurality of instructions are used to be loaded by a processor and execute the fused optical flow information as described above And the target tracking method of Siamese framework.
根据本发明的上述方案,基于整合了光流信息的特征图并结合Siamese框架进行目标跟踪,计算精度高、速度快,可以跟踪背景复杂以及剧烈运动的对象。According to the above-mentioned solution of the present invention, target tracking is performed based on a feature map integrating optical flow information and a Siamese framework, which has high calculation accuracy and fast speed, and can track objects with complex backgrounds and violent motions.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术 手段,并可依照说明书的内容予以实施,以下以本发明的较佳实施例并配合附图详细说明如后。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly and implement it in accordance with the content of the description, the preferred embodiments of the present invention are described in detail below in conjunction with the accompanying drawings.
附图说明Description of the drawings
构成本发明的一部分的附图用来提供对本发明的进一步理解,本发明提供如下附图进行说明。在附图中:The drawings constituting a part of the present invention are used to provide a further understanding of the present invention, and the present invention provides the following drawings for illustration. In the attached picture:
图1为本发明一个实施方式的融合光流信息和Siamese框架的目标跟踪系统的结构图;FIG. 1 is a structural diagram of a target tracking system integrating optical flow information and Siamese framework according to an embodiment of the present invention;
图2为本发明一个实施方式的时序打分模型原理图;2 is a schematic diagram of a time series scoring model according to an embodiment of the present invention;
图3A为传统的3×3卷积计算示意图;Figure 3A is a schematic diagram of traditional 3×3 convolution calculation;
图3B-图3C为可变形的卷积计算示意图;3B-3C are schematic diagrams of deformable convolution calculation;
图4为本发明提出的融合光流信息和Siamese框架的目标跟踪方法流程图;Figure 4 is a flow chart of the target tracking method fusing optical flow information and Siamese framework proposed by the present invention;
图5为本发明提出的融合光流信息和Siamese框架的目标跟踪装置组成框图。Fig. 5 is a block diagram of the target tracking device fusing optical flow information and Siamese framework proposed by the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明具体实施例及相应的附图对本发明技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the technical solutions of the present invention will be described clearly and completely in conjunction with specific embodiments of the present invention and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
首先结合图1说明本发明的融合光流信息和Siamese框架的目标跟踪系统结构,图1示出了本发明一个实施方式的融合光流信息和Siamese框架的目标跟踪系统的结构图。First, the structure of the target tracking system fusing optical flow information and the Siamese framework of the present invention will be explained with reference to FIG. 1. FIG. 1 shows the structure diagram of the target tracking system fusing optical flow information and the Siamese framework of an embodiment of the present invention.
获取当前帧,当前帧为第N帧(N>3),再获取当前帧的前面的三帧,分别是第N-3帧、第N-2帧、第N-1帧,第N-3帧、第N-2帧、第N-1帧分别和当前帧即第N帧使用TVNet光流网络来计算光流(所述TVNet光流网络可参见VALMADRE J,BERTINETTO L,HENRIQUES J,et al.End-to-end representation learning for correlation filter based tracking[C].Honolulu,Hawaii,USA.2017.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2805-2813),得到Flow1、Flow2、Flow3。并对Flow1、Flow2及Flow3进行裁剪(Crop)操作,得到22×22的光流矢量图P1、P2、P3。构建以AlexNet为基础的特征网络,所述特征网络是在AlexNet的基础上,去掉全连接层构建的。将当前帧输入特征网络,得到22×22的当前帧特征图F
N。将当前帧特征图F
N分别与光流矢量图P1、P2、P3结合,再对结合后的结果进行变形(Warp)操作,得到变形后的特征图F
1、F
2、F
3。最后将变形后的特征图F
1、F
2、F
3与当前帧特征图F
N作为候选检测帧输入时序打分模型,得到所述候选检测帧的特征权重,并将所述候选检测帧的特征权重与融合了光流特征的候选检测帧按照公式(1)相乘得到最终的检测帧。
Get the current frame, the current frame is the Nth frame (N>3), and then get the previous three frames of the current frame, which are the N-3th frame, the N-2th frame, the N-1th frame, and the N-3th frame. Frame, N-2th frame, N-1th frame and the current frame, namely the Nth frame, use TVNet optical flow network to calculate optical flow (the TVNet optical flow network can be found in VLMADRE J, BERTINETO L, HENRIQUES J, et al .End-to-end representation learning for correlation filter based tracking[C].Honolulu,Hawaii,USA.2017.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2805-2813), get Flow1, Flow2, Flow3 . Crop operation is performed on Flow1, Flow2, and Flow3 to obtain 22×22 optical flow vector diagrams P1, P2, P3. Construct a feature network based on AlexNet. The feature network is constructed on the basis of AlexNet without the fully connected layer. Input the current frame into the feature network to obtain a 22×22 current frame feature map F N. Combine the current frame feature map F N with the optical flow vector diagrams P1, P2, P3, and then perform a warp operation on the combined result to obtain deformed feature maps F 1 , F 2 , and F 3 . Finally, the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N are used as the candidate detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and the feature of the candidate detection frame The weight and the candidate detection frame fused with optical flow features are multiplied according to formula (1) to obtain the final detection frame.
其中,i表示当前帧的序号,I
i指当前帧第i帧,I
j指在当前帧I
i前面的某一帧如第j帧,j∈{i-T,…,i-2,i-1},本实施例中T=3,即当前帧的前面的三帧;
是当前帧通过融合其他帧光流信息后得到的最终的检测帧,w
j->i表示由时序打分模型计算并输出的候选检测帧的特征权重。f
j->i是将第j帧中的运动信息通过光流网络映射到第i帧,然后再将得出的光流结果图与第j帧图像进行变形(Warp)操作;
Among them, i represents the sequence number of the current frame, I i refers to the i-th frame of the current frame, I j refers to a frame before the current frame I i , such as the j-th frame, j∈{iT,...,i-2,i-1 }, T=3 in this embodiment, that is, the previous three frames of the current frame; It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w j->i represents the feature weight of the candidate detection frame calculated and output by the time series scoring model. f j->i is to map the motion information in the j-th frame to the i-th frame through the optical flow network, and then perform a warp operation on the resulting optical flow result image and the j-th frame image;
将第j帧中的运动信息通过光流网络映射到第i帧定义为Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as
f
j→i=W(f
j,M
i→j)=W(f
j,F(I
i,I
j))
f j→i =W(f j ,M i→j )=W(f j ,F(I i ,I j ))
其中,F(I
i,I
j)是通过所述光流网络对I
i和I
j进行光流计算,得出的结果实现了将第j帧中的运动信息映射到第i帧;f
j第i帧的特征图,W(,)是对所述光流计算得出的结果与I
j帧融合,对融合后的信息,进行变形(Warp)操作,应用到每个通道特征映射定位的线性形变方程进行变形(Warp)操作。
Wherein, F(I i , I j ) is the optical flow calculation of I i and I j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f j The feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel The linear deformation equation performs warp operation.
以下结合图2说明本发明的时序打分模型,图2示出了本发明的时序打分模型原理图。如图2所示,The following describes the time series scoring model of the present invention with reference to FIG. 2, and FIG. 2 shows the principle diagram of the time series scoring model of the present invention. as shown in picture 2,
所述时序打分模型为可变形卷积网络模型,训练好的所述时序打分模型通过对每个候选检测帧包含物体的信息量以及与当前帧的关联性进行打分,能够实现有效的候选检测帧权重大,效果小或者无效的候选检测帧权重小。所述时序打分模型的输入为未经打分的各个时段的变形后的特征图或当前帧的特征图,输出为候选检测帧的权重数值。The time series scoring model is a deformable convolutional network model, and the trained time series scoring model can achieve an effective candidate detection frame by scoring the amount of information contained in each candidate detection frame and the correlation with the current frame. The weight of the candidate detection frame is large, and the weight of the candidate detection frame with small effect or invalid is small. The input of the time series scoring model is the deformed feature map of each time period without scoring or the feature map of the current frame, and the output is the weight value of the candidate detection frame.
所述时序打分模型具有池化层,其中的池化层可以执行全局平均池化操作和全局最大值池化操作。所述时序打分模型的输入信息为未经打分的各个时段的变形后的特征图或当前帧的特征图,也称为候选检测帧,通过全局平均池化操作和全局最大值池化,对每个候选检测帧包含物体的信息量进行打分,得到操作后的中间矩阵,The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. The input information of the time series scoring model is the deformed feature map of each time period without scoring or the feature map of the current frame, which is also called candidate detection frame. Through global average pooling operation and global maximum pooling, each A candidate detection frame contains the information of the object to be scored, and the intermediate matrix after the operation is obtained,
所述全局平均池化操作为:The global average pooling operation is:
其中G
S-GA(...)表示全局平均池化过程。q
T表示T个候选检测帧,q
x和q
y表示特征图中的像素点,H表示输入到全局平均池化操作前特征图的高,W表示输入到全局平均池化操作前特征图的宽。
Where G S-GA (...) represents the global average pooling process. q T represents T candidate detection frames, q x and q y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width.
所述全局最大值池化操作为:The global maximum pooling operation is:
G
S-GM(q
T)=Max(q
T(q
x,q
y))
G S-GM (q T )=Max(q T (q x ,q y ))
G
S-GM(...)表示全局最大值池化过程。
G S-GM (...) represents the global maximum pooling process.
所述这全局平均池化操作输出一个T×1维的向量,构成全局平均池化中间矩阵,所述全局最大值池化操作也输出一个T×1维的向量,构成全局最大值池化中间矩阵。The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix.
将所述全局平均池化中间矩阵和所述全局最大值池化中间矩阵输入共享网络层,对每个候选帧与当前帧的关联性进行打分。通过共享网络层分别得到全局平均池化和最大值池化的权值矩阵,所述共享网络层实现卷积操作,参数由经验值或训练得到。再对两个权值矩阵进行逐元素相加操作,得到权重特征向 量。并将得到的权重特征向量作为激活函数Relu的输入,所述激活函数Relu为:The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the relevance of each candidate frame to the current frame is scored. The weight matrices of global average pooling and maximum pooling are respectively obtained through the shared network layer. The shared network layer implements the convolution operation, and the parameters are obtained by empirical values or training. Then the two weight matrices are added element-by-element to obtain the weight eigenvectors. And the obtained weight feature vector is used as the input of the activation function Relu, and the activation function Relu is:
其中,x指输入的所述权重特征向量,α为系数,α可以取值为0,从而得到候选检测帧时序权重。Wherein, x refers to the input weight feature vector, α is a coefficient, and α can take a value of 0 to obtain the candidate detection frame time sequence weight.
本实施例中,为了更好的提取候选检测帧的图像特征,所述共享网络层中的卷积神经滤波器采用可变形的卷积计算,所述卷积计算的公式如下:In this embodiment, in order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and the convolution calculation formula is as follows:
上述卷积计算公式是常规的卷积操作公式,W(p
n)指的是卷积核参数,X指的是待卷积的图像。
The aforementioned convolution calculation formula is a conventional convolution operation formula, W(p n ) refers to the convolution kernel parameter, and X refers to the image to be convolved.
在传统的卷积操作的作用区域上,加入了一个可学习的参数Δpn,该参数可以由全连接层卷积学习得到。In the active area of the traditional convolution operation, a learnable parameter Δpn is added, which can be learned by the fully connected layer convolution.
所述时序打分模型是由卷积神经网络模型根据损失函数The time series scoring model is based on the loss function of the convolutional neural network model
l(y,v)=log(1+exp(-yv))进行训练的,l(y,v)=log(1+exp(-yv)) for training,
其中v表示训练集中等待训练的图像的候选响应图每个点的真实值,y∈{+1,-1}表示标准跟踪框的标签。通过最小化上述损失函数来不断学习、训练,当所述损失函数趋于稳定时,所述时序打分模型训练完毕,得到所述时序打分模型的系数,利用训练好的时序打分模型对候选检测帧的权重数值进行计算。Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame. Continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the time series scoring model is trained to obtain the coefficients of the time series scoring model, and the trained time series scoring model is used to detect candidate frames The weight value is calculated.
以下结合图3说明可变形的卷积计算。The deformable convolution calculation will be described below in conjunction with FIG. 3.
如图3A所示,图3A是传统的3×3卷积计算,正方形区域内的9个像素参与线性计算y=∑
iw
ix
i,其中w
i为卷积滤波器的系数,x
i为图像的像素值。图3B-图3C为可变形卷积计算,可以看出,参与计算的9个点为当前图像中任意像素,这样的滤波器具有更好的多样性,所能提取的特征也更加丰富。
As shown in Figure 3A, Figure 3A is a traditional 3×3 convolution calculation. 9 pixels in the square area participate in the linear calculation y=∑ i w i x i , where w i is the coefficient of the convolution filter, x i Is the pixel value of the image. Figures 3B-3C are deformable convolution calculations. It can be seen that the 9 points involved in the calculation are any pixels in the current image. Such filters have better diversity and can extract more features.
以下结合图4说明本发明的融合光流信息和Siamese框架的目标跟踪方法,图4示出了本发明的融合光流信息和Siamese框架的目标跟踪方法流程图。如图4所示,所述方法包括:The following describes the target tracking method of the present invention combining optical flow information and the Siamese framework with reference to FIG. 4, and FIG. 4 shows a flowchart of the present invention's target tracking method combining optical flow information and the Siamese framework. As shown in Figure 4, the method includes:
S101:获取当前帧,当前帧为第N帧,其中N>3,再获取当前帧的前面的三 帧,分别是第N-3帧、第N-2帧、第N-1帧,所述第N-3帧、第N-2帧、第N-1帧分别和当前第N帧使用TVNet光流网络来计算光流,得到Flow1、Flow2、Flow3;并对Flow1、Flow2及Flow3进行裁剪(Crop)操作,得到22×22的光流矢量图P1、P2、P3;将当前帧输入特征网络,得到22×22的当前帧特征图F
N;将当前帧特征图F
N分别与光流矢量图P1、P2、P3结合,再对结合后的结果进行变形(Warp)操作,得到变形后的特征图F
1、F
2、F
3;
S101: Obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, namely the N-3th frame, the N-2th frame, and the N-1th frame. Frame N-3, Frame N-2, Frame N-1 and the current frame N respectively use the TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and crop Flow1, Flow2, and Flow3 ( Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F N ; separate the current frame feature map F N with the optical flow vector Figures P1, P2, and P3 are combined, and then warp the combined result to obtain the deformed feature maps F 1 , F 2 , F 3 ;
S102:将变形后的特征图F
1、F
2、F
3与当前帧特征图F
N作为候选检测帧输入时序打分模型,得到所述候选检测帧的特征权重,并将所述候选检测帧的特征权重与融合了光流特征的候选检测帧按照公式(1)相乘得到最终的检测帧;
S102: Use the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N as the candidate detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and combine the The feature weight and the candidate detection frame fused with optical flow features are multiplied according to formula (1) to obtain the final detection frame;
其中,i表示当前帧的序号,I
i指当前帧第i帧,I
j指在当前帧I
i前面的某一帧如第j帧,j∈{i-T,…,i-2,i-1},T=3,即当前帧的前面的三帧;
是当前帧通过融合其他帧光流信息后得到的最终的检测帧,w
j->i表示由时序打分模型计算并输出的候选检测帧的特征权重;f
j->i是将第j帧中的运动信息通过光流网络映射到第i帧,然后再将得出的光流结果图与第j帧图像进行变形(Warp)操作;
Among them, i represents the sequence number of the current frame, I i refers to the i-th frame of the current frame, I j refers to a frame before the current frame I i , such as the j-th frame, j∈{iT,...,i-2,i-1 }, T=3, that is, three frames before the current frame; It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w j->i represents the feature weight of the candidate detection frame calculated and output by the timing scoring model; f j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;
将第j帧中的运动信息通过光流网络映射到第i帧定义为Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as
f
j→i=W(f
j,M
i→j)=W(f
j,F(I
i,I
j))
f j→i =W(f j ,M i→j )=W(f j ,F(I i ,I j ))
其中,F(I
i,I
j)是通过所述光流网络对I
i和I
j进行光流计算,得出的结果实现了将第j帧中的运动信息映射到第i帧;f
j是第i帧的特征图,W(,)是对所述光流计算得出的结果与I
j帧融合,对融合后的信息,进行变形(Warp)操作,应用到每个通道特征映射定位的线性形变方程进行变形(Warp)操作;
Wherein, F(I i , I j ) is the optical flow calculation of I i and I j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f j Is the feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel Warp operation with the linear deformation equation;
其中,所述时序打分模型输入为未经打分的各个时段的变形后的特征图、当前帧的特征图,输出为候选检测帧的权重数值;Wherein, the input of the time series scoring model is the deformed feature map of each time period without scoring and the feature map of the current frame, and the output is the weight value of the candidate detection frame;
所述时序打分模型具有池化层,其中的池化层可以执行全局平均池化操作和全局最大值池化操作,通过全局平均池化操作和全局最大值池化,对每个候选检测帧包含物体的信息量进行打分,得到操作后的中间矩阵,The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. Through the global average pooling operation and the global maximum pooling, each candidate detection frame contains The information volume of the object is scored, and the intermediate matrix after the operation is obtained,
所述全局平均池化操作为:The global average pooling operation is:
其中G
S-GA(...)表示全局平均池化过程。q
T表示T个候选检测帧,q
x和q
y表示特征图中的像素点,H表示输入到全局平均池化操作前特征图的高,W表示输入到全局平均池化操作前特征图的宽。
Where G S-GA (...) represents the global average pooling process. q T represents T candidate detection frames, q x and q y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width.
所述全局最大值池化操作为:The global maximum pooling operation is:
G
S-GM(q
T)=Max(q
T(q
x,q
y))
G S-GM (q T )=Max(q T (q x ,q y ))
G
S-GM(...)表示全局最大值池化过程;
G S-GM (...) represents the global maximum pooling process;
所述这全局平均池化操作输出一个T×1维的向量,构成全局平均池化中间矩阵,所述全局最大值池化操作也输出一个T×1维的向量,构成全局最大值池化中间矩阵;The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix;
将所述全局平均池化中间矩阵和所述全局最大值池化中间矩阵输入共享网络层,对每个候选帧与当前帧的关联性进行打分;通过共享网络层分别得到全局平均池化和最大值池化的权值矩阵,所述共享网络层实现卷积操作,参数由经验值或训练得到;再对两个权值矩阵进行逐元素相加操作,得到权重特征向量;并将得到的权重特征向量作为激活函数Relu的输入,所述激活函数Relu为:The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the correlation between each candidate frame and the current frame is scored; through the shared network layer, the global average pooling and maximum are obtained respectively Value pooled weight matrix, the shared network layer implements convolution operation, and the parameters are obtained by experience or training; then two weight matrixes are added element by element to obtain the weight feature vector; and the weight obtained The feature vector is used as the input of the activation function Relu, and the activation function Relu is:
从而得到候选检测帧时序权重;Thus, the candidate detection frame time sequence weight is obtained;
所述时序打分模型是由卷积神经网络模型根据损失函数进行训练的,所述损失函数为:The time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:
l(y,v)=log(1+exp(-yv))l(y,v)=log(1+exp(-yv))
其中v表示训练集中等待训练的图像的候选响应图每个点的真实值,y∈{+1,-1}表示标准跟踪框的标签。通过最小化上述损失函数来不断学习、训练,当所述损失函数趋于稳定时,所述时序打分模型训练完毕,得到所述时序打分模型的系数,利用训练好的时序打分模型对候选检测帧的权重数值进行计算。Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame. Continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the time series scoring model is trained to obtain the coefficients of the time series scoring model, and the trained time series scoring model is used to detect candidate frames The weight value is calculated.
为了更好的提取候选检测帧的图像特征,所述共享网络层中的卷积神经滤 波器采用可变形的卷积计算,所述卷积计算的公式如下:In order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and the convolution calculation formula is as follows:
在传统的卷积操作的作用区域上,加入了一个可学习的参数Δpn。In the traditional convolution operation area, a learnable parameter Δpn is added.
请参考图5,其为本发明提出的融合光流信息和Siamese框架的目标跟踪装置组成框图。以下结合图5说明本发明的融合光流信息和Siamese框架的目标跟踪装置,如图所示,该装置包括:Please refer to FIG. 5, which is a block diagram of the target tracking device combining optical flow information and Siamese framework proposed by the present invention. The following describes the target tracking device fusing optical flow information and Siamese framework of the present invention with reference to FIG. 5. As shown in the figure, the device includes:
获取特征模块:用于获取当前帧,当前帧为第N帧,其中N>3,再获取当前帧的前面的三帧,分别是第N-3帧、第N-2帧、第N-1帧,所述第N-3帧、第N-2帧、第N-1帧分别和当前第N帧使用TVNet光流网络来计算光流,得到Flow1、Flow2、Flow3;并对Flow1、Flow2及Flow3进行裁剪(Crop)操作,得到22×22的光流矢量图P1、P2、P3;将当前帧输入特征网络,得到22×22的当前帧特征图F
N;将当前帧特征图F
N分别与光流矢量图P1、P2、P3结合,再对结合后的结果进行变形(Warp)操作,得到变形后的特征图F
1、F
2、F
3;
Obtain feature module: used to obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, which are the N-3th frame, the N-2th frame, and the N-1th frame. Frame, the N-3th, N-2th, N-1th frame and the current Nth frame respectively use TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and for Flow1, Flow2 and Flow3 performs cropping (Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F N ; separate the current frame feature map F N Combine with the optical flow vector diagrams P1, P2, P3, and then perform a warp operation on the combined result to obtain the deformed feature maps F 1 , F 2 , F 3 ;
权重计算模块:用于将变形后的特征图F
1、F
2、F
3与当前帧特征图F
N作为检测帧输入时序打分模型,得到所述候选检测帧的特征权重,并将所述候选检测帧的特征权重与融合了光流特征的候选检测帧按照公式(1)相乘得到最终的检测帧;
Weight calculation module: used to use the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N as the detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and compare the candidate The feature weight of the detection frame is multiplied by the candidate detection frame fused with optical flow features according to formula (1) to obtain the final detection frame;
i表示当前帧的序号,I
i指当前帧第i帧,I
j指在当前帧I
i前面的某一帧如第j帧,j∈{i-T,…,i-2,i-1},T=3,即当前帧的前面的三帧;
是当前帧通过融合其他帧光流信息后得到的最终的检测帧,w
j->i表示由时序打分模型计算并输出的候选检测帧的特征权重;f
j->i是将第j帧中的运动信息通过光流网络映射到第i帧,然后再将得出的光流结果图与第j帧图像进行变形(Warp)操作;
i represents the sequence number of the current frame, I i refers to the i-th frame of the current frame, I j refers to a certain frame before the current frame I i , such as the j-th frame, j∈{iT,...,i-2,i-1}, T=3, that is, the previous three frames of the current frame; It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w j->i represents the feature weight of the candidate detection frame calculated and output by the timing scoring model; f j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;
将第j帧中的运动信息通过光流网络映射到第i帧定义为Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as
f
j→i=W(f
j,M
i→j)=W(f
j,F(I
i,I
j))
f j→i =W(f j ,M i→j )=W(f j ,F(I i ,I j ))
其中,F(I
i,I
j)是通过所述光流网络对I
i和I
j进行光流计算,得出的结果实现了将第j帧中的运动信息映射到第i帧;f
j是第i帧的特征图,W(,)是对所述光 流计算得出的结果与I
j帧融合,对融合后的信息,进行变形(Warp)操作,应用到每个通道特征映射定位的线性形变方程进行变形(Warp)操作;
Wherein, F(I i , I j ) is the optical flow calculation of I i and I j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f j Is the feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel Warp operation with the linear deformation equation;
其中,所述时序打分模型输入为未经打分的各个时段的变形后的特征图F
1、F
2、F
3与当前帧特征图F
N,输出为候选检测帧的权重数值;
Wherein, the input of the time series scoring model is the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N in each time period without scoring, and the output is the weight value of the candidate detection frame;
所述时序打分模型具有池化层,其中的池化层可以执行全局平均池化操作和全局最大值池化操作,通过全局平均池化操作和全局最大值池化,对每个候选检测帧包含物体的信息量进行打分,得到操作后的中间矩阵,The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. Through the global average pooling operation and the global maximum pooling, each candidate detection frame contains The information volume of the object is scored, and the intermediate matrix after the operation is obtained,
所述全局平均池化操作为:The global average pooling operation is:
其中G
S-GA(...)表示全局平均池化过程。q
T表示T个候选检测帧,q
x和q
y表示特征图中的像素点,H表示输入特征图的高,W表示输入特征图的宽;
Where G S-GA (...) represents the global average pooling process. q T represents T candidate detection frames, q x and q y represent pixels in the feature map, H represents the height of the input feature map, and W represents the width of the input feature map;
所述全局最大值池化操作为:The global maximum pooling operation is:
G
S-GM(q
T)=Max(q
T(q
x,q
y))
G S-GM (q T )=Max(q T (q x ,q y ))
G
S-GM(...)表示全局最大值池化过程;
G S-GM (...) represents the global maximum pooling process;
所述这全局平均池化操作输出一个T×1维的向量,构成全局平均池化中间矩阵,所述全局最大值池化操作也输出一个T×1维的向量,构成全局最大值池化中间矩阵;The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix;
将所述全局平均池化中间矩阵和所述全局最大值池化中间矩阵输入共享网络层,对每个候选帧与当前帧的关联性进行打分;通过共享网络层分别得到全局平均池化和最大值池化的权值矩阵,所述共享网络层实现卷积操作,参数由经验值或训练得到;再对两个权值矩阵进行逐元素相加操作,得到权重特征向量;并将得到的权重特征向量作为激活函数Relu的输入,所述激活函数Relu为:The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the correlation between each candidate frame and the current frame is scored; through the shared network layer, the global average pooling and maximum are obtained respectively Value pooled weight matrix, the shared network layer implements convolution operation, and the parameters are obtained by experience or training; then two weight matrixes are added element by element to obtain the weight feature vector; and the weight obtained The feature vector is used as the input of the activation function Relu, and the activation function Relu is:
其中,x指输入的所述权重特征向量,α为系数,Where x refers to the input weight feature vector, α is the coefficient,
所述时序打分模型是由卷积神经网络模型根据损失函数进行训练的。The time series scoring model is trained by the convolutional neural network model according to the loss function.
进一步地,所述时序打分模型是由卷积神经网络模型根据损失函数进行训 练的,所述损失函数为:Further, the time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:
l(y,v)=log(1+exp(-yv))l(y,v)=log(1+exp(-yv))
其中v表示训练集中等待训练的图像的候选响应图每个点的真实值,y∈{+1,-1}表示标准跟踪框的标签;通过最小化上述损失函数来不断学习、训练,当所述损失函数趋于稳定时,所述时序打分模型训练完毕,得到所述时序打分模型的系数,利用训练好的时序打分模型对候选检测帧的权重数值进行计算,从而得到候选检测帧时序权重。Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame; continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the training of the time series scoring model is completed, the coefficients of the time series scoring model are obtained, and the weight values of candidate detection frames are calculated using the trained time series scoring model to obtain the candidate detection frame time sequence weights.
进一步地,为了更好的提取候选检测帧的图像特征,所述共享网络层中的卷积神经滤波器采用可变形的卷积计算,在传统的卷积操作的作用区域上,加入了一个可学习的参数Δpn。Further, in order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and adds a variable to the traditional convolution operation area. The learned parameter Δpn.
本发明实施例进一步给出一种融合光流信息和Siamese框架的目标跟踪系统,包括:The embodiment of the present invention further provides a target tracking system integrating optical flow information and Siamese framework, including:
处理器,用于执行多条指令;Processor, used to execute multiple instructions;
存储器,用于存储多条指令;Memory, used to store multiple instructions;
其中,所述多条指令,用于由所述存储器存储,并由所述处理器加载并执行如上所述的融合光流信息和Siamese框架的目标跟踪方法。Wherein, the multiple instructions are used to be stored by the memory and loaded by the processor to execute the target tracking method of fusing optical flow information and Siamese framework as described above.
本发明实施例进一步给出一种计算机可读存储介质,所述存储介质中存储有多条指令;所述多条指令,用于由处理器加载并执行如上所述的融合光流信息和Siamese框架的目标跟踪方法。The embodiment of the present invention further provides a computer-readable storage medium in which multiple instructions are stored; the multiple instructions are used by a processor to load and execute the above-mentioned fused optical flow information and Siamese Framework's target tracking method.
需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。It should be noted that the embodiments of the present invention and the features in the embodiments can be combined with each other if there is no conflict.
在本发明所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或 直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined. Or it can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机装置(可以是个人计算机,实体机服务器,或者网络云服务器等,需安装Windows或者Windows Server操作系统)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a physical machine server, or a network cloud server, etc., need to install Windows or Windows Server operating system) to execute each of the present invention Part of the steps of the method described in the embodiment. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
以上所述,仅是本发明的较佳实施例而已,并非对本发明作任何形式上的限制,依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰,均仍属于本发明技术方案的范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the present invention in any form. Any simple modifications, equivalent changes and modifications made to the above embodiments based on the technical essence of the present invention still belong to the present invention. Within the scope of the technical solution of the invention.