WO2021035807A1

WO2021035807A1 - Target tracking method and device fusing optical flow information and siamese framework

Info

Publication number: WO2021035807A1
Application number: PCT/CN2019/105275
Authority: WO
Inventors: 曹文明; 李宇鸿; 何志权
Original assignee: 深圳大学
Priority date: 2019-08-23
Filing date: 2019-09-11
Publication date: 2021-03-04
Also published as: CN110619655B; CN110619655A

Abstract

The present invention provides a target tracking method and device fusing optical flow information and a Siamese framework. The method comprises: obtaining optical flow information of a current Nth frame and the previous three frames of the current frame, and obtaining the current frame again, wherein the current frame is the Nth frame, and N is greater than 3; processing the previous three frames of the current frame to obtain warped feature maps; and inputting the warped feature maps and a current frame feature map as detection frames into a time sequence scoring model to obtain weights of features of each feature map, i.e., an optical flow integration frame, and performing operation on the weights of each feature map and the feature maps to obtain a final detection frame. According to the solution of the present invention, target tracking is performed on the basis of the feature map integrated with the optical flow information and in combination with the Siamese framework, the calculation precision is high, the speed is high, and objects with complex backgrounds and violent motion can be tracked.

Description

Target tracking method and device combining optical flow information and Siamese frame

Technical field

The present invention relates to the field of image recognition, in particular to a target tracking method and device combining optical flow information and Siamese framework.

Background technique

With the rapid development of computer vision, single target tracking has attracted more and more attention from the public. Tracking algorithm From the generative model algorithm of Kalman, particle filter and feature point matching to the current differential model algorithm based on the correlation filtering framework and Siamese (twin) framework, the accuracy and operation speed of the tracking algorithm are continuously improving.

The advantage of the generative model algorithm based on feature point matching is that the model structure is simple, there is no training process, but the calculation accuracy is not high, and the feature points will disappear when there is occlusion; the full convolutional network model algorithm based on the Siamese framework has a fast calculation speed, but only Taking into account the appearance characteristics of the image, it is impossible to track objects with complex backgrounds and violent movements.

Summary of the invention

In order to solve the above technical problems, the present invention proposes a target tracking method and device combining optical flow information and Siamese framework to solve the problem of low calculation accuracy of generative model algorithms based on feature point matching in the prior art and based on Siamese framework. The full convolutional network model algorithm can not track the technical problem of complex background and violently moving objects.

According to the first aspect of the present invention, a target tracking method fusing optical flow information and Siamese framework is provided, including:

S101: Obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, namely the N-3th frame, the N-2th frame, and the N-1th frame. Frame N-3, Frame N-2, Frame N-1 and the current frame N respectively use the TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and crop Flow1, Flow2, and Flow3 ( Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F _N ; separate the current frame feature map F _N with the optical flow vector Figures P1, P2, and P3 are combined, and then warp the combined result to obtain the deformed feature maps F ₁ , F ₂ , F ₃ ;

S102: Use the deformed feature maps F ₁ , F ₂ , F ₃ and the current frame feature map F _N as the detection frame input timing scoring model to obtain the feature weights of the candidate detection frames, and combine the features of the candidate detection frames The weight and the candidate detection frame fused with optical flow features are multiplied according to formula (1) to obtain the final detection frame;

i represents the sequence number of the current frame, I _i refers to the i-th frame of the current frame, I _j refers to a certain frame before the current frame I _i , such as the j-th frame, j∈{iT,...,i-2,i-1}, T=3, that is, the previous three frames of the current frame;

It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w _j->i represents the feature weight of the candidate detection frame calculated and output by the timing scoring model; f _j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;

Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as

f _j→i ＝W(f _j ,M _i→j ) ₌ W(f _j ,F(I _i ,I _j ))

Wherein, F(I _i , I _j ) is the optical flow _{calculation of I i} and I _j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f _j Is the feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I _j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel Warp operation with the linear deformation equation;

Wherein, the input of the time series scoring model is the deformed feature maps F ₁ , F ₂ , F ₃ and the current frame feature map F _N in each time period without scoring, and the output is the weight value of the candidate detection frame;

The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. Through the global average pooling operation and the global maximum pooling, each candidate detection frame contains The information volume of the object is scored, and the intermediate matrix after the operation is obtained,

The global average pooling operation is:

Where G _S-GA (...) represents the global average pooling process. q _T represents T candidate detection frames, q _x and q _y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width;

The global maximum pooling operation is:

G _S-GM (q _T )=Max(q _T (q _x ,q _y ))

G _S-GM (...) represents the global maximum pooling process;

The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix;

The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the correlation between each candidate frame and the current frame is scored; through the shared network layer, the global average pooling and maximum are obtained respectively Value pooled weight matrix, the shared network layer implements convolution operation, and the parameters are obtained by experience or training; then two weight matrixes are added element by element to obtain the weight feature vector; and the weight obtained The feature vector is used as the input of the activation function Relu, and the activation function Relu is:

Where x refers to the input weight feature vector, α is the coefficient,

The time series scoring model is trained by the convolutional neural network model according to the loss function.

further,

The time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:

l(y,v)=log(1+exp(-yv))

Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame; continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the training of the time series scoring model is completed, the coefficients of the time series scoring model are obtained, and the weight values of candidate detection frames are calculated using the trained time series scoring model to obtain the candidate detection frame time sequence weights.

Further, in order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and adds a variable to the traditional convolution operation area. The learned parameter Δpn.

According to the second aspect of the present invention, a target tracking device integrating optical flow information and Siamese framework is provided, including:

Obtain feature module: used to obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, which are the N-3th frame, the N-2th frame, and the N-1th frame. Frame, the N-3th, N-2th, N-1th frame and the current Nth frame respectively use TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and for Flow1, Flow2 and Flow3 performs cropping (Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F _N ; separate the current frame feature map F _N Combine with the optical flow vector diagrams P1, P2, P3, and then perform a warp operation on the combined result to obtain the deformed feature maps F ₁ , F ₂ , F ₃ ;

Weight calculation module: used to use the deformed feature maps F ₁ , F ₂ , F ₃ and the current frame feature map F _N as the detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and compare the candidate The feature weight of the detection frame is multiplied by the candidate detection frame fused with optical flow features according to formula (1) to obtain the final detection frame;

It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w _j->i represents the feature weight of the candidate detection frame calculated and output by the time series scoring model; f _j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;

f _j→i ＝W(f _j ,M _i→j )=W(f _j ,F(I _i ,I _j ))

The global average pooling operation is:

The global maximum pooling operation is:

G _S-GM (q _T )=Max(q _T (q _x ,q _y ))

G _S-GM (...) represents the global maximum pooling process;

Where x refers to the input weight feature vector, α is the coefficient,

Further, the time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:

l(y,v)=log(1+exp(-yv))

According to the third aspect of the present invention, a target tracking system integrating optical flow information and Siamese framework is provided, including:

Processor, used to execute multiple instructions;

Memory, used to store multiple instructions;

Wherein, the multiple instructions are used to be stored by the memory and loaded by the processor to execute the aforementioned target tracking method combining optical flow information and Siamese framework.

According to a fourth aspect of the present invention, there is provided a computer-readable storage medium in which a plurality of instructions are stored; the plurality of instructions are used to be loaded by a processor and execute the fused optical flow information as described above And the target tracking method of Siamese framework.

According to the above-mentioned solution of the present invention, target tracking is performed based on a feature map integrating optical flow information and a Siamese framework, which has high calculation accuracy and fast speed, and can track objects with complex backgrounds and violent motions.

The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly and implement it in accordance with the content of the description, the preferred embodiments of the present invention are described in detail below in conjunction with the accompanying drawings.

Description of the drawings

The drawings constituting a part of the present invention are used to provide a further understanding of the present invention, and the present invention provides the following drawings for illustration. In the attached picture:

FIG. 1 is a structural diagram of a target tracking system integrating optical flow information and Siamese framework according to an embodiment of the present invention;

2 is a schematic diagram of a time series scoring model according to an embodiment of the present invention;

Figure 3A is a schematic diagram of traditional 3×3 convolution calculation;

3B-3C are schematic diagrams of deformable convolution calculation;

Figure 4 is a flow chart of the target tracking method fusing optical flow information and Siamese framework proposed by the present invention;

Fig. 5 is a block diagram of the target tracking device fusing optical flow information and Siamese framework proposed by the present invention.

detailed description

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the technical solutions of the present invention will be described clearly and completely in conjunction with specific embodiments of the present invention and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

First, the structure of the target tracking system fusing optical flow information and the Siamese framework of the present invention will be explained with reference to FIG. 1. FIG. 1 shows the structure diagram of the target tracking system fusing optical flow information and the Siamese framework of an embodiment of the present invention.

Get the current frame, the current frame is the Nth frame (N>3), and then get the previous three frames of the current frame, which are the N-3th frame, the N-2th frame, the N-1th frame, and the N-3th frame. Frame, N-2th frame, N-1th frame and the current frame, namely the Nth frame, use TVNet optical flow network to calculate optical flow (the TVNet optical flow network can be found in VLMADRE J, BERTINETO L, HENRIQUES J, et al .End-to-end representation learning for correlation filter based tracking[C].Honolulu,Hawaii,USA.2017.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2805-2813), get Flow1, Flow2, Flow3 . Crop operation is performed on Flow1, Flow2, and Flow3 to obtain 22×22 optical flow vector diagrams P1, P2, P3. Construct a feature network based on AlexNet. The feature network is constructed on the basis of AlexNet without the fully connected layer. Input the current frame into the feature network to obtain a 22×22 current frame feature map F _N. Combine the current frame feature map F _N with the optical flow vector diagrams P1, P2, P3, and then perform a warp operation on the combined result to obtain deformed feature maps F ₁ , F ₂ , and F ₃ . Finally, the deformed feature maps F ₁ , F ₂ , F ₃ and the current frame feature map F _{N are} used as the candidate detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and the feature of the candidate detection frame The weight and the candidate detection frame fused with optical flow features are multiplied according to formula (1) to obtain the final detection frame.

Among them, i represents the sequence number of the current frame, I _i refers to the i-th frame of the current frame, I _j refers to a frame before the current frame I _i , such as the j-th frame, j∈{iT,...,i-2,i-1 }, T=3 in this embodiment, that is, the previous three frames of the current frame;

It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w _j->i represents the feature weight of the candidate detection frame calculated and output by the time series scoring model. f _j->i is to map the motion information in the j-th frame to the i-th frame through the optical flow network, and then perform a warp operation on the resulting optical flow result image and the j-th frame image;

f _j→i ＝W(f _j ,M _i→j )=W(f _j ,F(I _i ,I _j ))

Wherein, F(I _i , I _j ) is the optical flow _{calculation of I i} and I _j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f _j The feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I _j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel The linear deformation equation performs warp operation.

The following describes the time series scoring model of the present invention with reference to FIG. 2, and FIG. 2 shows the principle diagram of the time series scoring model of the present invention. as shown in picture 2,

The time series scoring model is a deformable convolutional network model, and the trained time series scoring model can achieve an effective candidate detection frame by scoring the amount of information contained in each candidate detection frame and the correlation with the current frame. The weight of the candidate detection frame is large, and the weight of the candidate detection frame with small effect or invalid is small. The input of the time series scoring model is the deformed feature map of each time period without scoring or the feature map of the current frame, and the output is the weight value of the candidate detection frame.

The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. The input information of the time series scoring model is the deformed feature map of each time period without scoring or the feature map of the current frame, which is also called candidate detection frame. Through global average pooling operation and global maximum pooling, each A candidate detection frame contains the information of the object to be scored, and the intermediate matrix after the operation is obtained,

The global average pooling operation is:

Where G _S-GA (...) represents the global average pooling process. q _T represents T candidate detection frames, q _x and q _y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width.

The global maximum pooling operation is:

G _S-GM (q _T )=Max(q _T (q _x ,q _y ))

G _S-GM (...) represents the global maximum pooling process.

The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix.

The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the relevance of each candidate frame to the current frame is scored. The weight matrices of global average pooling and maximum pooling are respectively obtained through the shared network layer. The shared network layer implements the convolution operation, and the parameters are obtained by empirical values or training. Then the two weight matrices are added element-by-element to obtain the weight eigenvectors. And the obtained weight feature vector is used as the input of the activation function Relu, and the activation function Relu is:

Wherein, x refers to the input weight feature vector, α is a coefficient, and α can take a value of 0 to obtain the candidate detection frame time sequence weight.

In this embodiment, in order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and the convolution calculation formula is as follows:

The aforementioned convolution calculation formula is a conventional convolution operation formula, W(p _n ) refers to the convolution kernel parameter, and X refers to the image to be convolved.

In the active area of the traditional convolution operation, a learnable parameter Δpn is added, which can be learned by the fully connected layer convolution.

The time series scoring model is based on the loss function of the convolutional neural network model

l(y,v)=log(1+exp(-yv)) for training,

Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame. Continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the time series scoring model is trained to obtain the coefficients of the time series scoring model, and the trained time series scoring model is used to detect candidate frames The weight value is calculated.

The deformable convolution calculation will be described below in conjunction with FIG. 3.

As shown in Figure 3A, Figure 3A is a traditional 3×3 convolution calculation. 9 pixels in the square area participate in the linear calculation y=∑ _i w _i x _i , where w _i is the coefficient of the convolution filter, x _i Is the pixel value of the image. Figures 3B-3C are deformable convolution calculations. It can be seen that the 9 points involved in the calculation are any pixels in the current image. Such filters have better diversity and can extract more features.

The following describes the target tracking method of the present invention combining optical flow information and the Siamese framework with reference to FIG. 4, and FIG. 4 shows a flowchart of the present invention's target tracking method combining optical flow information and the Siamese framework. As shown in Figure 4, the method includes:

S102: Use the deformed feature maps F ₁ , F ₂ , F ₃ and the current frame feature map F _N as the candidate detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and combine the The feature weight and the candidate detection frame fused with optical flow features are multiplied according to formula (1) to obtain the final detection frame;

Among them, i represents the sequence number of the current frame, I _i refers to the i-th frame of the current frame, I _j refers to a frame before the current frame I _i , such as the j-th frame, j∈{iT,...,i-2,i-1 }, T=3, that is, three frames before the current frame;

f _j→i ＝W(f _j ,M _i→j )=W(f _j ,F(I _i ,I _j ))

Wherein, the input of the time series scoring model is the deformed feature map of each time period without scoring and the feature map of the current frame, and the output is the weight value of the candidate detection frame;

The global average pooling operation is:

The global maximum pooling operation is:

G _S-GM (q _T )=Max(q _T (q _x ,q _y ))

G _S-GM (...) represents the global maximum pooling process;

Thus, the candidate detection frame time sequence weight is obtained;

l(y,v)=log(1+exp(-yv))

In order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and the convolution calculation formula is as follows:

In the traditional convolution operation area, a learnable parameter Δpn is added.

Please refer to FIG. 5, which is a block diagram of the target tracking device combining optical flow information and Siamese framework proposed by the present invention. The following describes the target tracking device fusing optical flow information and Siamese framework of the present invention with reference to FIG. 5. As shown in the figure, the device includes:

f _j→i ＝W(f _j ,M _i→j )=W(f _j ,F(I _i ,I _j ))

The global average pooling operation is:

Where G _S-GA (...) represents the global average pooling process. q _T represents T candidate detection frames, q _x and q _y represent pixels in the feature map, H represents the height of the input feature map, and W represents the width of the input feature map;

The global maximum pooling operation is:

G _S-GM (q _T )=Max(q _T (q _x ,q _y ))

G _S-GM (...) represents the global maximum pooling process;

Where x refers to the input weight feature vector, α is the coefficient,

l(y,v)=log(1+exp(-yv))

The embodiment of the present invention further provides a target tracking system integrating optical flow information and Siamese framework, including:

Processor, used to execute multiple instructions;

Memory, used to store multiple instructions;

Wherein, the multiple instructions are used to be stored by the memory and loaded by the processor to execute the target tracking method of fusing optical flow information and Siamese framework as described above.

The embodiment of the present invention further provides a computer-readable storage medium in which multiple instructions are stored; the multiple instructions are used by a processor to load and execute the above-mentioned fused optical flow information and Siamese Framework's target tracking method.

It should be noted that the embodiments of the present invention and the features in the embodiments can be combined with each other if there is no conflict.

In the several embodiments provided by the present invention, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined. Or it can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a physical machine server, or a network cloud server, etc., need to install Windows or Windows Server operating system) to execute each of the present invention Part of the steps of the method described in the embodiment. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

The above are only preferred embodiments of the present invention, and are not intended to limit the present invention in any form. Any simple modifications, equivalent changes and modifications made to the above embodiments based on the technical essence of the present invention still belong to the present invention. Within the scope of the technical solution of the invention.

Claims

A target tracking method integrating optical flow information and Siamese framework is characterized in that the method includes:

S101: Obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, namely the N-3th frame, the N-2th frame, and the N-1th frame. Frame N-3, Frame N-2, Frame N-1 and the current frame N respectively use the TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and crop Flow1, Flow2, and Flow3 ( Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F N ; separate the current frame feature map F N with the optical flow vector Figures P1, P2, and P3 are combined, and then warp the combined result to obtain the deformed feature maps F 1 , F 2 , F 3 ;

S102: Use the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N as the detection frame input timing scoring model to obtain the feature weights of the candidate detection frames, and combine the features of the candidate detection frames The weight and the candidate detection frame fused with optical flow features are multiplied according to formula (1) to obtain the final detection frame;

i represents the sequence number of the current frame, I i refers to the i-th frame of the current frame, I j refers to a certain frame before the current frame I i , such as the j-th frame, j∈{iT,...,i-2,i-1}, T=3, that is, the previous three frames of the current frame;
It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w j->i represents the feature weight of the candidate detection frame calculated and output by the timing scoring model; f j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;

Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as

f j→i ＝W(f j ,M i→j )=W(f j ,F(I i ,I j ))

Wherein, F(I i , I j ) is the optical flow calculation of I i and I j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f j Is the feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel Warp operation with the linear deformation equation;

Wherein, the input of the time series scoring model is the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N in each time period without scoring, and the output is the weight value of the candidate detection frame;

The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. Through the global average pooling operation and the global maximum pooling, each candidate detection frame contains The information volume of the object is scored, and the intermediate matrix after the operation is obtained,

The global average pooling operation is:

Where G S-GA (...) represents the global average pooling process. q T represents T candidate detection frames, q x and q y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width;

The global maximum pooling operation is:

G S-GM (q T )=Max(q T (q x ,q y ))

G S-GM (...) represents the global maximum pooling process;

The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix;

The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the correlation between each candidate frame and the current frame is scored; through the shared network layer, the global average pooling and maximum are obtained respectively Value pooled weight matrix, the shared network layer implements convolution operation, and the parameters are obtained by experience or training; then two weight matrixes are added element by element to obtain the weight feature vector; and the weight obtained The feature vector is used as the input of the activation function Relu, and the activation function Relu is:

Where x refers to the input weight feature vector, α is the coefficient,

The time series scoring model is trained by the convolutional neural network model according to the loss function.
The target tracking method of fusing optical flow information and Siamese framework according to claim 1, wherein the time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:

Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame; continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the training of the time series scoring model is completed, the coefficients of the time series scoring model are obtained, and the weight values of candidate detection frames are calculated using the trained time series scoring model to obtain the candidate detection frame time sequence weights.
The target tracking method combining optical flow information and Siamese framework as claimed in claim 1, characterized in that:

In order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and adds a learnable parameter to the traditional convolution operation area. Δpn.
A target tracking device fusing optical flow information and Siamese framework, characterized in that the device includes:

Obtain feature module: used to obtain the current frame, the current frame is the Nth frame, where N>3, and then obtain the previous three frames of the current frame, which are the N-3th frame, the N-2th frame, and the N-1th frame. Frame, the N-3th, N-2th, N-1th frame and the current Nth frame respectively use TVNet optical flow network to calculate the optical flow to obtain Flow1, Flow2, Flow3; and for Flow1, Flow2 and Flow3 performs cropping (Crop) operation to obtain 22×22 optical flow vector diagrams P1, P2, P3; input the current frame into the feature network to obtain a 22×22 current frame feature map F N ; separate the current frame feature map F N Combine with the optical flow vector diagrams P1, P2, P3, and then perform a warp operation on the combined result to obtain the deformed feature maps F 1 , F 2 , F 3 ;

Weight calculation module: used to use the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N as the detection frame input timing scoring model to obtain the feature weight of the candidate detection frame, and compare the candidate The feature weight of the detection frame is multiplied by the candidate detection frame fused with optical flow features according to formula (1) to obtain the final detection frame;

i represents the sequence number of the current frame, I i refers to the i-th frame of the current frame, I j refers to a certain frame before the current frame I i , such as the j-th frame, j∈{iT,...,i-2,i-1}, T=3, that is, the previous three frames of the current frame;
It is the final detection frame obtained by fusing the optical flow information of other frames in the current frame, w j->i represents the feature weight of the candidate detection frame calculated and output by the time series scoring model; f j->i is the jth frame The motion information of is mapped to the i-th frame through the optical flow network, and then the resulting optical flow result image and the j-th frame image are warped (Warp) operation;

Mapping the motion information in the j-th frame to the i-th frame through the optical flow network is defined as

f j→i ＝W(f j ,M i→j )=W(f j ,F(I i ,I j ))

Wherein, F(I i , I j ) is the optical flow calculation of I i and I j through the optical flow network, and the result obtained realizes the mapping of the motion information in the jth frame to the i-th frame; f j Is the feature map of the i-th frame, W(,) is the result of the optical flow calculation and the I j frame is fused, the fused information is warped (Warp) operation, and it is applied to the feature map positioning of each channel Warp operation with the linear deformation equation;

Wherein, the input of the time series scoring model is the deformed feature maps F 1 , F 2 , F 3 and the current frame feature map F N in each time period without scoring, and the output is the weight value of the candidate detection frame;

The time series scoring model has a pooling layer, where the pooling layer can perform a global average pooling operation and a global maximum pooling operation. Through the global average pooling operation and the global maximum pooling, each candidate detection frame contains The information volume of the object is scored, and the intermediate matrix after the operation is obtained,

The global average pooling operation is:

Where G S-GA (...) represents the global average pooling process. q T represents T candidate detection frames, q x and q y represent pixels in the feature map, H represents the height of the feature map before the input to the global average pooling operation, and W represents the input to the feature map before the global average pooling operation width;

The global maximum pooling operation is:

G S-GM (q T )=Max(q T (q x ,q y ))

G S-GM (...) represents the global maximum pooling process;

The global average pooling operation outputs a T×1 dimensional vector to form a global average pooling intermediate matrix, and the global maximum pooling operation also outputs a T×1 dimensional vector to form the global maximum pooling intermediate matrix. matrix;

The global average pooling intermediate matrix and the global maximum pooling intermediate matrix are input to the shared network layer, and the correlation between each candidate frame and the current frame is scored; through the shared network layer, the global average pooling and maximum are obtained respectively Value pooled weight matrix, the shared network layer implements convolution operation, and the parameters are obtained by experience or training; then two weight matrixes are added element by element to obtain the weight feature vector; and the weight obtained The feature vector is used as the input of the activation function Relu, and the activation function Relu is:

Where x refers to the input weight feature vector, α is the coefficient,

The time series scoring model is trained by the convolutional neural network model according to the loss function.
The target tracking device integrating optical flow information and Siamese framework according to claim 4, characterized in that:

The time series scoring model is trained by a convolutional neural network model according to a loss function, and the loss function is:

Where v represents the true value of each point in the candidate response graph of the image waiting to be trained in the training set, and y∈{+1,-1} represents the label of the standard tracking frame; continuous learning and training are performed by minimizing the above loss function. When the loss function becomes stable, the training of the time series scoring model is completed, the coefficients of the time series scoring model are obtained, and the weight values of candidate detection frames are calculated using the trained time series scoring model to obtain the candidate detection frame time sequence weights.
The target tracking device integrating optical flow information and Siamese framework according to claim 4, characterized in that:

In order to better extract the image features of the candidate detection frame, the convolutional neural filter in the shared network layer adopts deformable convolution calculation, and adds a learnable parameter to the traditional convolution operation area. Δpn.
A target tracking system integrating optical flow information and Siamese framework is characterized in that it comprises:

Processor, used to execute multiple instructions;

Memory, used to store multiple instructions;

Wherein, the multiple instructions are used to be stored by the memory, loaded by the processor and executed according to any one of claims 1 to 3 of the target tracking method fusing optical flow information and Siamese framework.
A computer-readable storage medium, characterized in that a plurality of instructions are stored in the storage medium; the plurality of instructions are used to be loaded by a processor and execute the fusion optics according to any one of claims 1-3. Flow information and target tracking method of Siamese framework.