CN114155273B - Video image single-target tracking method combining historical track information - Google Patents

Video image single-target tracking method combining historical track information Download PDF

Info

Publication number
CN114155273B
CN114155273B CN202111221441.0A CN202111221441A CN114155273B CN 114155273 B CN114155273 B CN 114155273B CN 202111221441 A CN202111221441 A CN 202111221441A CN 114155273 B CN114155273 B CN 114155273B
Authority
CN
China
Prior art keywords
image
target
feature map
layer
regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111221441.0A
Other languages
Chinese (zh)
Other versions
CN114155273A (en
Inventor
杨兆龙
庞惠民
夏永清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dali Technology Co ltd
Original Assignee
Zhejiang Dali Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dali Technology Co ltd filed Critical Zhejiang Dali Technology Co ltd
Priority to CN202111221441.0A priority Critical patent/CN114155273B/en
Publication of CN114155273A publication Critical patent/CN114155273A/en
Application granted granted Critical
Publication of CN114155273B publication Critical patent/CN114155273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video image single-target tracking method combining historical track information, which comprises the following steps: respectively sending the template image and the current frame search image into a trained convolutional neural network feature extraction layer to obtain a template image feature map and a search image feature map; the template image feature map and the search image feature map are sequentially sent into a trained convolutional neural network classification layer and a trained convolutional neural network regression layer, so that a classification feature map and a regression feature map of the template image and a classification feature map and a regression feature map of the search image are obtained; performing cross-correlation operation on the classification features and the regression feature graphs of the template image and the search image to obtain a classification layer response graph and a regression layer response graph of the template image and the search image; performing maximum pooling operation on the classification layer response graphs of the template image and the search image; and finding out the predicted coordinate value which is most approximate to the predicted coordinate of the target in the previous frame searching image and the history track of the target in the previous M frame searching image, and taking the predicted coordinate value as the final predicted coordinate value of the target in the current frame searching image.

Description

Video image single-target tracking method combining historical track information
Technical Field
The invention relates to a single target tracking method combining historical track information, which relates to a twin neural network and a single target tracking method of the historical track information, belonging to the field of image processing and computer vision,
Background
Computer vision is a subject of specially researching how a computer can "see" like a human, and refers to that the camera and the computer are used for replacing human eyes, so that the machine can see the functions of extracting, identifying, tracking and the like of the human brain on a target.
Target tracking is to analyze a video picture sequence, match each detected candidate target area and locate the coordinates of the targets in the video sequence. In short, the positioning is performed for the target in the sequence image. The research of the target tracking algorithm is a hotspot in the field of computer vision, and has important research and application values in scenes such as virtual reality, man-machine interaction, intelligent monitoring, augmented reality, machine perception and the like.
The problem of target tracking in a single scene is mainly studied for continuous tracking of a single target, i.e. tracking only one specific target in a video sequence taken by a single camera. The study in this regard has been conducted around two basic problems: first, object appearance modeling, also known as object matching, is known. The method establishes a corresponding apparent model according to the apparent characteristic data of the target, and is the most important module of the algorithm. The quality of the apparent feature establishment directly influences the tracking accuracy and robustness, and commonly adopted features include contours, colors, textures and the like. Second, tracking policy. In the target tracking process, all contents in the scene are directly matched for searching the optimal position, which can certainly increase a large amount of redundant information, and the defects of large operation amount, low speed and the like are caused. Effective effects are obtained by narrowing the search range through priori knowledge, and typical methods include hidden Markov models, kalman filtering, mean shift algorithms, particle filtering and the like.
Target tracking algorithms can be divided into two categories: discriminant tracking and generative tracking. The generated tracking algorithm directly models the target without considering background information, a model is established through learning to represent the target, and then the model is directly matched with the target category, so that the tracking purpose is achieved. The discriminant method models the tracking problem as a binary classification problem to find a decision boundary that distinguishes the target object from the background, maximally distinguishing the target region from the non-target region. In recent years, the deep learning algorithm rapidly becomes a research hotspot, and achieves good effects in the field of computer vision. The deep learning method based on the twin neural network plays a significant role in the field of single-target tracking, siamFC is an algorithm of applying a typical twin neural network to single-target tracking, specifically, the structure has two inputs, one is a template serving as a reference, and the other is a candidate sample to be selected. In the single-target tracking task, the template serving as a reference is an object to be tracked, a target object in a first frame of a video sequence is usually selected, a candidate sample is an image search area in each frame later, and the twin network is required to find a candidate area which is the most similar to the template in the first frame in each frame later, namely the target in the frame, so that tracking of one target can be realized. The deep learning method remarkably improves the tracking speed and the tracking precision of the tracker. SiamFC and other twin network-based methods can meet the real-time requirement on high-performance operation equipment, but the method does not consider the historical track information of the target when the target is tracked, and when the same object as the target appears in a scene and the distance is close, the target is easy to lose, so that the accuracy of a tracking algorithm is reduced.
Disclosure of Invention
The invention solves the technical problems that: the method for tracking the single target by combining the historical track information is provided to solve the problem that the tracking target is lost when the same or similar targets appear in the scene.
The technical scheme for solving the technical problems is as follows: a video image single-target tracking method combined with historical track information comprises the following steps:
s1, acquiring a template image and a current frame search image;
S2, respectively sending the template image and the current frame search image into a trained convolutional neural network feature extraction layer to obtain a template image feature map and a search image feature map;
S3, sequentially sending the template image feature map and the search image feature map into a trained convolutional neural network classification layer and a trained regression layer to obtain a classification feature map and a regression feature map of the template image and a classification feature map and a regression feature map of the search image;
s4, performing cross-correlation operation on the classification characteristic image of the template image and the classification characteristic image of the search image to obtain a classification layer response image of the template image and the search image; performing cross-correlation operation on the regression feature map of the template image and the regression feature map of the search image to obtain a regression layer response map of the template image and the search image;
s5, carrying out maximum pooling operation on the classifying layer response graphs of the template image and the search image;
S6, taking out N front characteristic points from high to low of response values in the pooled classified layer response graph, calculating regression layer output corresponding to the N characteristic points, and obtaining N predicted coordinate values of the target in the current frame search image according to the regression layer output;
S7, if the current frame is the previous M frames of the video image, taking the predicted coordinate value corresponding to the maximum response value in the classified layer response diagram as the final predicted coordinate value record of the target in the current frame searching image; if the current frame is the Mth frame and the following frames of the video image, the step S8 is entered;
s8, finding out the predicted coordinate value which is most similar to the predicted coordinate of the target in the previous frame searching image and the history track of the target in the previous M frame searching image from N predicted coordinate values, and taking the predicted coordinate value as the final predicted coordinate value of the target in the current frame searching image, wherein M and N are more than or equal to 2.
Preferably, the cross-correlation operation in step S4 is as follows:
F(z,x)=z*x+b
Wherein b is deviation, Z is a classification layer regression layer feature map or regression layer feature map of the template image, x is a classification layer regression layer feature map or regression layer feature map of the search image, and F is a classification layer response map of the template image and the search image or a regression layer response map of the template image and the search image.
Preferably, the trained convolutional neural network feature extraction layer is Alexnet network.
Preferably, the dimensions of the feature graphs before and after the pooling operation in the step S5 are consistent.
Preferably, the specific steps of the step S8 are as follows:
S8.1, acquiring a historical track coordinate { [ x i,yi],i=1~M},(xi,yi ] of a target in a previous M-frame searching image to represent a predicted coordinate value of the target in an i-th frame searching image before a current frame;
S8.2, calculating historical track direction information of a target, wherein the historical track direction information of the target comprises direction information o i from the (i+1) th frame of target position to the (i) th frame of target position before the current frame, and i=1-M;
s8.3, N predicted coordinate values (a j,bj) are obtained, and j=1 to N;
S8.4, calculating deviation of each predicted coordinate value and the predicted coordinate of the target in the previous frame of search image:
dj=(aj-x1,bj-y1),j=1~N;
S8.5, calculating the similarity between each predicted coordinate value and the target historical track;
S8.6, selecting a predicted coordinate point corresponding to the S j with the minimum similarity as a final output
Preferably, the specific calculation formula of the similarity between the jth predicted coordinate value and the target historical track in the step S8.5 is as follows:
Sj=sj,1+sj,2
Wherein s j,1 is the first component of s j and s j,2 is the first component of s j; λ is a weight parameter, typically set to 1.
Preferably, the classification layer adopts a binary cross entropy function as a loss function during training.
Preferably, the regression layer uses smoothL1 as a loss function during training.
Compared with the prior art, the invention has the beneficial effects that:
(1) When similar targets appear in the pictures, the invention can better detect and position the targets and improve the target tracking precision by considering the historical track information and the current prediction distance information.
(2) The method has certain robustness aiming at the situation that the tracked target is blocked.
Drawings
FIG. 1 is a diagram of a network architecture of an embodiment of the present invention;
FIG. 2 is a flow chart of the single target tracking of the present invention in combination with historical track information.
Detailed Description
The single-target tracking method combining the historical track information provided by the invention is further described below with reference to the accompanying drawings and the detailed description. Advantages and features of the invention will become more apparent from the following description and from the claims.
As shown in fig. 1 and 2, the present invention provides a video image single-target tracking method in combination with historical track information, which includes the following steps:
s1, acquiring a template image and a current frame search image;
S2, respectively sending the template image and the current frame search image into a trained convolutional neural network feature extraction layer to obtain a template image feature map and a search image feature map;
The trained convolutional neural network feature extraction layer is Alexnet networks.
S3, sequentially sending the template image feature map and the search image feature map into a trained convolutional neural network classification layer and a trained regression layer to obtain a classification feature map and a regression feature map of the template image and a classification feature map and a regression feature map of the search image;
s4, performing cross-correlation operation on the classification characteristic image of the template image and the classification characteristic image of the search image to obtain a classification layer response image of the template image and the search image; performing cross-correlation operation on the regression feature map of the template image and the regression feature map of the search image to obtain a regression layer response map of the template image and the search image;
The cross-correlation operation is as follows:
F(z,x)=z*x+b
Wherein b is deviation, Z is a classification layer regression layer feature map or regression layer feature map of the template image, x is a classification layer regression layer feature map or regression layer feature map of the search image, and F is a classification layer response map of the template image and the search image or a regression layer response map of the template image and the search image.
S5, carrying out maximum pooling operation on the classifying layer response graphs of the template image and the search image; the dimension of the feature graphs is consistent before and after the pooling operation;
s6, from the top N feature points of the response value in the classified layer response diagram after the pooling is taken out, calculating regression layer output corresponding to the N feature points, and obtaining N predicted coordinate values of the target in the current frame search image according to the regression layer output;
S7, if the current frame is the previous M frames of the video image, taking the predicted coordinate value corresponding to the maximum response value in the classified layer response diagram as the final predicted coordinate value record of the target in the current frame searching image; if the current frame is the Mth frame and the following frames of the video image, the step S8 is entered;
s8, finding out the predicted coordinate value which is most similar to the predicted coordinate of the target in the previous frame searching image and the history track of the target in the previous M frame searching image from N predicted coordinate values, and taking the predicted coordinate value as the final predicted coordinate value of the target in the current frame searching image, wherein M and N are more than or equal to 2.
The method comprises the following specific steps:
S8.1, acquiring a historical track coordinate { [ x i,yi],i=1~M},(xi,yi ] of a target in a previous M-frame searching image to represent a predicted coordinate value of the target in an i-th frame searching image before a current frame;
S8.2, calculating historical track direction information of a target, wherein the historical track direction information of the target comprises direction information o i from the (i+1) th frame of target position to the (i) th frame of target position before the current frame, and i=1-M;
s8.3, N predicted coordinate values (a j,bj) are obtained, and j=1 to N;
S8.4, calculating deviation of each predicted coordinate value and the predicted coordinate of the target in the previous frame of search image:
dj=(aj-x1,bj-y1),j=1~N;
S8.5, calculating the similarity between each predicted coordinate value and the target historical track; the specific calculation formula of the similarity between the jth predicted coordinate value and the target historical track is as follows:
Sj=sj,1+sj,2
Wherein s j,1 is the first component of s j and s j,2 is the first component of s j; λ is a weight parameter, typically set to 1.
S8.6, selecting a predicted coordinate point corresponding to the S j with the minimum similarity as a final output
Examples:
in a specific embodiment of the invention, a universal network Alexnet in the field of image classification is used as a framework to construct a Siamese convolutional neural network, wherein the Siamese convolutional neural network comprises a feature extraction layer, a classification layer and a regression layer. And training the Siamese convolutional neural network model by using a common dataset ILVSRC in the single-target tracking field and 800 videos which are automatically and practically shot and marked as training data. The key points of the model training process are as follows:
And (1) performing size normalization and data enhancement processing on the images in the video.
A target frame (x min,ymin, w, h) is obtained from the first frame image in the video, where x min and y min represent the point location coordinates of the upper left corner of the real frame, respectively, and w and h represent the width and height of the target frame, respectively. Then, for each frame of image, taking the center point of the target frame as the center, cutting out a picture with the size of 127 x 127 as a template image, and cutting out a picture with the size of 255 x 255 as a search image. If the template image or the search image is not cut enough in the original image, the insufficient part is filled according to the average value of the RGB channels.
Performing data enhancement operations on the search image includes rotating the template image, adding noise, color dithering, and the like.
Key point 2, building network model
Referring to fig. 2, the network structure used in the present invention includes a feature extraction layer, a classification layer, and a regression layer.
The single-target tracking network has two identical feature extraction layers, and the two feature extraction layers share parameters. Namely, the single-target tracking network is divided into a searching branch and a template branch; where the template branches input template images, e.g. 127 x 3 template images, 127 x 127 representing the input image resolution, 3 representing the number of channels of the input image, typically RGB images. The search branch inputs a search image, for example, an image of 255×255×3 size.
The two branch networks of the feature extraction layer are both Alexnet-based convolutional neural networks, the network structures and parameters are identical, and the two branch networks comprise a first convolutional layer Conv1, a first pooling layer Pool1, a second pooling layer Pool2, a third convolutional layer Conv3, a fourth convolutional layer Conv4 and a fifth convolutional layer Conv5 which are sequentially connected. The specific parameters are as follows: the convolution kernel size of Conv1 is 11×11, the step length is 2, and the output channel number is 96; the convolution kernel size of Pool1 is 3×3, the step length is 2, and the output channel number is 96; the convolution kernel size of Pool2 is 3×3, the step size is 2, and the number of output channels is 256; the convolution kernel sizes of Conv3 and Conv4 are 3×3, the step sizes are 1, and the number of output channels is 192; conv5 has a convolution sum of 3×3, a step size of 1, and an output channel number of 128.
At the classification level, a convolution kernel size 3*3 is used first, the number of output channels is 256, then followed by a convolution of the size of convolution kernel 1*1, the number of output channels is 128.
At the regression layer, a convolution kernel size 3*3 is used first, the number of output channels is 256, then followed by a convolution kernel 1*1 size, the number of output channels is 128.
The related operation process is as follows: taking the template image input 127 x 3 and the search image input 255 x 3 as examples, 6 x 128 template image classification feature images and 23 x 128 search image classification feature images are respectively obtained, then, a step size s=1 is set by using 6×6×128 as a convolution kernel and using 23×23×128 as an input feature map, and a pad=0 is convolved to output a 17×17×1 classification layer response map.
The related operation process is as follows: taking the template image input 127 x 3 and the search image input 255 x 3 as examples, respectively obtaining a template image regression feature map of 6 x 128 and a search image regression feature map of 23 x 128, then, a step s=1 is set by using 6×6×128 as a convolution kernel and using 23×23×128 as an input feature map, and a pad=0 is convolved to output a feature map with a size of 17×17×1. Finally, using 1*1 convolutions, the number of output channels was 4, resulting in a 17 x 4 regression layer response plot.
Key point 3, loss function
At the classification level, the present invention uses a binary cross entropy function as the loss function. And when positive and negative samples are set, the sample points falling into the real target frame when the classification layer is mapped back to the original image are set as positive samples, and the other samples are set as negative samples.
At the regression layer, a feature map of 17×17×4 is obtained, where the regression score, that is, the position regression value of each sample, respectively represents the distance to the target frame. The loss function uses smoothL a loss function.
The final loss is as follows:
loss=φcls2φreg
loss is the sum of the classification loss and the regression loss, lambda 2 represents the superparameter, set to 0.5, and control the weight of the regression loss function.
In this embodiment, after the feature layer, the classification layer and the regression layer are established, in the video image single-target tracking method provided by the invention, step S5 adopts the maximum pooling layer of 3*3.
If the current processing frame is positioned in the first 5 frames, calculating the target position according to the maximum response point of the classification layer. And recording the current predicted target position. When the processed frame is greater than 5 frames, then the new target position is predicted in combination with the historical track information. The method comprises the following steps:
And taking out the first 4 maximum value response points of the classification layer, and calculating the output of the regression layer corresponding to the four values. Thus, four different predicted coordinates are obtained, and the four predicted coordinates, the tracking target of the previous frame and the historical track are calculated to obtain the most approximate predicted coordinates as the final output.
In step S6, the response values in the pooled classified layer response graphs are taken out from the top to the top of 4 feature points, and coordinates in the regression layer response graphs corresponding to the 4 feature points are calculated, so that 4 predicted coordinate values of the target in the current frame search image are obtained;
in step S7, if the current frame is the first 5 frames of the video image, the predicted coordinate value corresponding to the maximum response value in the classified layer response diagram is used as the final predicted coordinate value record of the target in the current frame searching image; if the current frame is the 5 th frame and the following frames of the video image, proceeding to step S8;
In step S8, from among the 4 predicted coordinate values, the predicted coordinate value most approximate to the predicted coordinate of the target in the previous frame search image and the predicted coordinate value most approximate to the history locus of the target in the previous 5 frame search image is found as the final predicted coordinate value of the target in the current frame search image.
The method comprises the following specific steps:
S8.1, acquiring historical track coordinates of targets in previous M-frame search images
{(x5,y5),(x4,y4),(x3,y3),(x2,y2),(x1,y1)}, Representing a predicted coordinate value of a target in an i-th frame search image before the current frame;
S8.2, calculating historical track direction information of a target, wherein the historical track direction information of the target comprises direction information o i from the (i+1) th frame of target position to the (i) th frame of target position before the current frame, and i=1-M;
taking N equal to 5 as an example, specifically:
o4=(x4-x5,y4-y5)
o3=(x3-x4,y3-y4)
o2=(x2-x3,y2-y3)
o1=(x1-x2,y1-y2)
s8.3, 4 predicted coordinate values (a j,bj) are obtained, and j=1 to N;
S8.4, calculating deviation of each predicted coordinate value and the predicted coordinate of the target in the previous frame of search image:
dj=(aj-x1,bj-y1),j=1~4;
S8.5, calculating the similarity between each predicted coordinate value and the target historical track; the specific calculation formula of the similarity between the jth predicted coordinate value and the target historical track is as follows:
Sj=sj,1+sj,2
Wherein s j,1 is the first component of s j and s j,2 is the first component of s j; λ is a weight parameter, typically set to 1.
S8.6, selecting a predicted coordinate point corresponding to the S j with the minimum similarity as a final output
Although the present invention has been described in terms of the preferred embodiments, it is not intended to be limited to the embodiments, and any person skilled in the art can make any possible variations and modifications to the technical solution of the present invention by using the methods and technical matters disclosed above without departing from the spirit and scope of the present invention, so any simple modifications, equivalent variations and modifications to the embodiments described above according to the technical matters of the present invention are within the scope of the technical matters of the present invention.

Claims (6)

1. A video image single-target tracking method combining historical track information is characterized by comprising the following steps:
s1, acquiring a template image and a current frame search image;
S2, respectively sending the template image and the current frame search image into a trained convolutional neural network feature extraction layer to obtain a template image feature map and a search image feature map;
S3, sequentially sending the template image feature map and the search image feature map into a trained convolutional neural network classification layer and a trained regression layer to obtain a classification feature map and a regression feature map of the template image and a classification feature map and a regression feature map of the search image;
s4, performing cross-correlation operation on the classification characteristic image of the template image and the classification characteristic image of the search image to obtain a classification layer response image of the template image and the search image; performing cross-correlation operation on the regression feature map of the template image and the regression feature map of the search image to obtain a regression layer response map of the template image and the search image;
s5, carrying out maximum pooling operation on the classifying layer response graphs of the template image and the search image;
S6, taking out N front characteristic points from high to low of response values in the pooled classified layer response graph, calculating regression layer output corresponding to the N characteristic points, and obtaining N predicted coordinate values of the target in the current frame search image according to the regression layer output;
S7, if the current frame is the previous M frames of the video image, taking the predicted coordinate value corresponding to the maximum response value in the classified layer response diagram as the final predicted coordinate value record of the target in the current frame searching image; if the current frame is the Mth frame and the following frames of the video image, the step S8 is entered;
S8, finding out the predicted coordinate value which is most approximate to the predicted coordinate of the target in the previous frame of search image and the history track of the target in the previous M frame of search image from N predicted coordinate values, and taking the predicted coordinate value as the final predicted coordinate value of the target in the current frame of search image, wherein M and N are more than or equal to 2;
the specific steps of the step S8 are as follows:
S8.1, acquiring a historical track coordinate { [ x i,yi],i=1~M},(xi,yi ] of a target in a previous M-frame searching image to represent a predicted coordinate value of the target in an i-th frame searching image before a current frame;
S8.2, calculating historical track direction information of a target, wherein the historical track direction information of the target comprises direction information o i from the (i+1) th frame of target position to the (i) th frame of target position before the current frame, and i=1-M;
s8.3, N predicted coordinate values (a j,bj) are obtained, and j=1 to N;
S8.4, calculating deviation of each predicted coordinate value and the predicted coordinate of the target in the previous frame of search image:
dj=(aj-x1,bj-y1),j=1~N;
S8.5, calculating the similarity between each predicted coordinate value and the target historical track;
the specific calculation formula of the similarity between the jth predicted coordinate value and the target historical track is as follows:
Sj=sj,1+sj,2
Wherein s j,1 is the first component of s j and s j,2 is the second component of s j; lambda is a weight parameter and is set to be 1;
S8.6, selecting a predicted coordinate point corresponding to the S j with the minimum similarity as a final output.
2. The single-object tracking method in combination with historical track information according to claim 1, wherein the cross-correlation operation in step S4 is as follows:
F(z,x)=z*x+b
Wherein b is deviation, z is a classification layer regression layer feature map or regression layer feature map of the template image, x is a classification layer regression layer feature map or regression layer feature map of the search image, and F is a classification layer response map of the template image and the search image or a regression layer response map of the template image and the search image.
3. The method for single-target tracking in combination with historical track information according to claim 1, wherein the trained convolutional neural network feature extraction layer is Alexnet network.
4. The single-object tracking method according to claim 1, wherein the step S5 is performed with consistent dimensions of feature maps before and after the pooling operation.
5. The method for single-object tracking in combination with historical track information according to claim 1, wherein the classification layer uses a binary cross entropy function as a loss function during training.
6. The method of claim 1, wherein the regression layer uses smoothL a as a loss function during training.
CN202111221441.0A 2021-10-20 2021-10-20 Video image single-target tracking method combining historical track information Active CN114155273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111221441.0A CN114155273B (en) 2021-10-20 2021-10-20 Video image single-target tracking method combining historical track information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111221441.0A CN114155273B (en) 2021-10-20 2021-10-20 Video image single-target tracking method combining historical track information

Publications (2)

Publication Number Publication Date
CN114155273A CN114155273A (en) 2022-03-08
CN114155273B true CN114155273B (en) 2024-06-04

Family

ID=80462833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111221441.0A Active CN114155273B (en) 2021-10-20 2021-10-20 Video image single-target tracking method combining historical track information

Country Status (1)

Country Link
CN (1) CN114155273B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187969B (en) * 2022-09-14 2022-12-09 河南工学院 Lead-acid battery recovery system and method based on visual identification

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517292A (en) * 2019-08-29 2019-11-29 京东方科技集团股份有限公司 Method for tracking target, device, system and computer readable storage medium
CN110675432A (en) * 2019-10-11 2020-01-10 智慧视通(杭州)科技发展有限公司 Multi-dimensional feature fusion-based video multi-target tracking method
CN111797716A (en) * 2020-06-16 2020-10-20 电子科技大学 Single target tracking method based on Siamese network
CN111860352A (en) * 2020-07-23 2020-10-30 上海高重信息科技有限公司 Multi-lens vehicle track full-tracking system and method
CN113506317A (en) * 2021-06-07 2021-10-15 北京百卓网络技术有限公司 Multi-target tracking method based on Mask R-CNN and apparent feature fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517292A (en) * 2019-08-29 2019-11-29 京东方科技集团股份有限公司 Method for tracking target, device, system and computer readable storage medium
CN110675432A (en) * 2019-10-11 2020-01-10 智慧视通(杭州)科技发展有限公司 Multi-dimensional feature fusion-based video multi-target tracking method
CN111797716A (en) * 2020-06-16 2020-10-20 电子科技大学 Single target tracking method based on Siamese network
CN111860352A (en) * 2020-07-23 2020-10-30 上海高重信息科技有限公司 Multi-lens vehicle track full-tracking system and method
CN113506317A (en) * 2021-06-07 2021-10-15 北京百卓网络技术有限公司 Multi-target tracking method based on Mask R-CNN and apparent feature fusion

Also Published As

Publication number Publication date
CN114155273A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
CN108197587B (en) Method for performing multi-mode face recognition through face depth prediction
CN109344701B (en) Kinect-based dynamic gesture recognition method
CN112270249A (en) Target pose estimation method fusing RGB-D visual features
CN111161317A (en) Single-target tracking method based on multiple networks
CN108388882B (en) Gesture recognition method based on global-local RGB-D multi-mode
CN110110755B (en) Pedestrian re-identification detection method and device based on PTGAN region difference and multiple branches
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN108470178B (en) Depth map significance detection method combined with depth credibility evaluation factor
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
CN110006444B (en) Anti-interference visual odometer construction method based on optimized Gaussian mixture model
CN112927264B (en) Unmanned aerial vehicle tracking shooting system and RGBD tracking method thereof
CN110992378B (en) Dynamic updating vision tracking aerial photographing method and system based on rotor flying robot
CN113592894B (en) Image segmentation method based on boundary box and co-occurrence feature prediction
CN111882581B (en) Multi-target tracking method for depth feature association
CN113850136A (en) Yolov5 and BCNN-based vehicle orientation identification method and system
CN106529441A (en) Fuzzy boundary fragmentation-based depth motion map human body action recognition method
CN114120389A (en) Network training and video frame processing method, device, equipment and storage medium
CN114155273B (en) Video image single-target tracking method combining historical track information
CN111814705A (en) Pedestrian re-identification method based on batch blocking shielding network
CN110688512A (en) Pedestrian image search algorithm based on PTGAN region gap and depth neural network
CN113255549B (en) Intelligent recognition method and system for behavior state of wolf-swarm hunting
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN112001954B (en) Underwater PCA-SIFT image matching method based on polar curve constraint
CN106650814B (en) Outdoor road self-adaptive classifier generation method based on vehicle-mounted monocular vision
CN117133032A (en) Personnel identification and positioning method based on RGB-D image under face shielding condition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant