CN112116626A - Single-target tracking method based on flexible convolution - Google Patents
Single-target tracking method based on flexible convolution Download PDFInfo
- Publication number
- CN112116626A CN112116626A CN202010773674.0A CN202010773674A CN112116626A CN 112116626 A CN112116626 A CN 112116626A CN 202010773674 A CN202010773674 A CN 202010773674A CN 112116626 A CN112116626 A CN 112116626A
- Authority
- CN
- China
- Prior art keywords
- flexible
- layer
- target
- convolution
- network model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000011176 pooling Methods 0.000 claims abstract description 20
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000010586 diagram Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 13
- 239000000126 substance Substances 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
- G06T2207/20132—Image cropping
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30204—Marker
Abstract
The embodiment of the invention provides a single target tracking method based on flexible convolution, which is used for constructing a flexible convolution network model, wherein the flexible convolution network model comprises a shared layer and a specific domain layer, and is trained by utilizing a data set, and the method comprises the following steps: s1, acquiring an original video sequence and preprocessing the original video sequence; s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking. According to the embodiment of the invention, a single-target tracking method based on flexible convolution is utilized, the problem that an object is easy to deform in single-target tracking is effectively solved, and meanwhile, the accuracy of a candidate target area is improved through RoI pooling.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a single-target tracking method based on flexible convolution.
Background
The single target tracking technology refers to a technology for identifying and positioning a given target in a video sequence given any target to be tracked in an initial state. The single-target tracking technology is always a research hotspot in the field of computer vision, and can be widely applied to multiple fields of video monitoring, unmanned driving, human-computer interaction and the like.
Because an object is easy to deform (such as dimension change, rotation, posture change and the like) in the motion process, the single-target tracking method adopted by the prior art is difficult to solve the problem, so that the tracking effect is poor. Such as: the traditional deep learning uses a traditional convolution mode to extract features, the traditional convolution is regular and has a fixed geometric size, such as 3 × 3 size, 5 × 5 size and the like, the sampled region is also a region with a fixed geometric size, the traditional single-target tracking algorithm is used for extracting features by using traditional convolution operation based on the traditional single-target tracking algorithm, and then target tracking is performed by using a corresponding tracking model, for example, the MDNet single-target tracking algorithm is used for realizing tracking by using a traditional convolution method. The traditional convolution neural network adopts the same convolution operation on different characteristic graphs, and the positions of sampled pixel points are fixed, so that the sampled information comprises a plurality of background characteristics and cannot be adaptive to the characteristics of an object.
Disclosure of Invention
The embodiment of the invention provides a single-target tracking method based on flexible convolution, which is used for overcoming the defects of the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme.
A single target tracking method based on flexible convolution constructs a flexible convolution network model, wherein the flexible convolution network model comprises a sharing layer and a specific domain layer, and is trained by utilizing a data set, and the method comprises the following steps:
s1, acquiring an original video sequence and preprocessing the original video sequence;
s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.
Preferably, the flexible convolutional network model comprises a shared layer and a domain-specific layer, wherein the shared layer comprises 3 convolutional layers conv1-3 and 2 fully-connected layers fc4-5, each fully-connected layer has 512 output units, and relu and pooling layers are respectively arranged between every two adjacent convolutional layers and between the two fully-connected layers;
the specific domain layer is a full connection layer fc6 in the flexible convolution network model1-fc6KSaid full connection layer fc61-fc6KThere are K domains, each domain contains a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain.
Preferably, the softmax cross entropy function formula is as follows:
wherein i1Is input, j1For the number of inputs, e is 2.7.
Preferably, the shared layer acquires the shared characteristic of the target through a convolution operation, including:
sampling on the input feature map x using a regular grid R using a flexible convolution operation, increasing the position offset Δ Pn({ΔPnN, N R | for each position P |, 10Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map0Point, due to the original ruleAdding a two-dimensional offset value delta P on an x axis and a y axis on the basis of the offset of the grid RnThe offset value Δ PnIs a floating point value obtained by calculating bilinear interpolation of 4 real values around;
after flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram0With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ PnThe value of (c).
Preferably, the performing flexible RoI pooling selection of candidate target regions comprises:
dividing the RoI of w h of the flexible RoI pooling layer pair into k bins and outputting a k feature map y for the (i) th0,j0) A memory cell, 0 ≦ i0,j0<k,i0Is the ith in the candidate target region0Line, j0Is j-th of the candidate target region0The columns of the image data are,
where x is the input, P is each location in the regular network,is the number of pixels in bin, output after flexible RoI pooling
preferably, the loss function in S2 is:
wherein the content of the first and second substances,indicates the probability that the ith region is predicted as a positive example, yiThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the threshold value is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained.
According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides the single-target tracking method based on the flexible convolution, the extracted features and the candidate target area are more accurate and better in effect through the flexible convolution and the flexible RoI pooling, the problem that the tracked object is easy to deform in the target tracking process is effectively solved, and the defects in the prior art are overcome.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a single-target tracking method based on flexible convolution according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a flexible convolutional network model training process according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a testing process of the flexible convolutional network model according to the embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
The embodiment of the invention provides a single-target tracking method based on flexible convolution, which specifically comprises the following steps as shown in figures 1-3:
and S1, acquiring the original video sequence, preprocessing the original video sequence, and cutting the picture into 107 × 107.
S2, inputting the preprocessed video sequence into a flexible convolution network model, enabling a sharing layer to obtain sharing characteristics of a target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and a background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.
Firstly, a flexible convolution network model is constructed, wherein the flexible convolution network model comprises a sharing layer and a specific domain layer, and the method specifically comprises the following steps:
based on an MDNet (Multi-Domain volumetric Neural Networks) single-target tracking network, a traditional convolution method is modified into flexible convolution, a sharing layer of a flexible convolution network model comprises five hidden layers which are respectively 3 convolution layers (conv1-3) and 2 full-connection layers (fc4-5), each full-connection layer is provided with 512 output units, and relu layers and posing layers are respectively arranged between every two adjacent convolution layers and between the two full-connection layers.
The specific domain layer is used for learning information on the specific domain and is the last full connection layer (fc 6) corresponding to K domains in the flexible convolution network model1-fc6K) And each domain in the K domains comprises a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain, so that the target characteristics are learned more accurately. Wherein, the softmax cross quotient function is shown as the following formula:
in the formula i1Is input, j1For the number of inputs, e is 2.7.
The feature extraction adopts flexible convolution operation, sampling is carried out on an input feature mapping x by using a regular grid R, and the traditional convolution operation is that for grids R with the size of 3 multiplied by 3 { (-1, -1), (-1,0),. }, (0, -1), (1,1) }
Wherein, PnIs a position of the convolution kernel, ω is the convolution kernel, for each position P0After performing position offset
Wherein, { Δ PnN, i.e. for each pair, N ═ R |, i.e. for each onePosition P0Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map0And (4) point. Because a two-dimensional deviation delta P is added on the basis of the R deviation of the original regular gridn(offset in x-and y-axes), which is a floating point value, needs to be obtained by bilinear interpolation of the surrounding 4 real values, i.e.
Wherein g (a, b) ═ max (0,1- | a-b |). After flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram0With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ PnThe value of (c).
And after the flexible convolution operation is carried out, putting the extracted features into a full connection layer, carrying out background and target binary classification, and finally carrying out RoI pooling to select a candidate target area so as to finish target tracking.
Performing RoI pooling to select candidate target region, firstly performing standard RoI pooling on the input feature map x, and then outputting standard k × k offset through a full connection layerThen according to the formulaCalculating Delta Pijγ is used to adjust the magnitude of the offset, and is empirically set to a value of 0.1, with w and h being the width and height of the RoI, respectively. The offset value Δ PijStill a floating point value needs to be obtained by computing a bilinear interpolation of the surrounding 4 real values. The method comprises the following specific steps:
the flexible RoI pooling layer pair w × h RoI is segmented into a k × k region (bin) and outputs a k × k profile y. For the (i) th0,j0) Storage units (i is more than or equal to 0, j is less than k)
Wherein i0Is the ith in the candidate target region0Line, j0Is j-th of the candidate target region0Columns; x is the input, P is each location in the regular network, nijIs the number of pixels in bin, output after flexible RoI pooling
Wherein, Δ PijIs offset and is { Δ Pij|0≤i,j<k}。
The feature extraction of each video sequence specifically comprises:
the video sequence is RGB pictures, and each picture is expressed as after the characteristic of each picture is extracted
x=[b,H,W,C]
Wherein, b is the current blocksize of the picture, C is the number of channels of the picture, the value is 3, the number is three channels of RGB, and H and W are picture pixel values.
As shown in fig. 2, offline learning is performed before tracking starts, and the purpose of offline learning is to train parameters through current training data, while conv1, conv2 and conv3 layer parameters are not updated during online tracking and fc4 and fc5 layer parameters are updated during online tracking. The original training data is continuous video frames cut in data containing a plurality of videos, and each frame of image passes through an artificially labeled group channel (hereinafter abbreviated as gt-channel) for representing the position of the tracking target in the image. The box is represented by a vector (x, y, w, h), where (x, y) is the coordinate of the box center point within the image, w represents the box column width, and h represents the box row height. In each frame of image in each video sequence, according to the gt-box, using a uniform random method to establish IoU boxes of 50 gt-boxes being greater than or equal to 0.7 as positive samples and 200 boxes of IoU being less than or equal to 0.5 as negative samples, and training the flexible convolution network model by using the data set specifically comprises the following steps:
the first step is as follows: and (5) initializing. The parameter { w1, w 2.. w5} is the result of the MDNet model pre-training, and w6 is the result of the random initialization.
The second step is that: and (5) performing regression training on the bounding box. According to the positions of the gt-boxes, IoU of 1000 gt-boxes are established to be boxes of more than or equal to 0.7 by using a uniform random method, images in the range of 1000 boxes are scaled to obtain 1000 training data with 107 x 107 resolution as input samples, and the parameters of fc _ box are obtained by using a linear regression algorithm. After the bounding box regression is completed, fc _ box is not updated until the next trace.
The third step: and training the network. According to the positions of the gt-boxes, boxes with IoU being more than or equal to 0.7 of 500 gt-boxes are established as positive samples by a Gaussian distribution random method, and boxes with IoU being more than or equal to 0.3 of 5000 gt-boxes are established as negative samples by a uniform random method. The learning rate of fc4 and fc5 layers is set to be 0.0001, the learning rate of fc6 layers is set to be 0.001, and 30 times of iterative training (SGD) are carried out. The mini batch size of each iteration is 128, and the parameters of fc4-6 layers are updated after training is completed by using 32 positive samples selected randomly and 96 hard negative samples selected from 1024 negative samples randomly.
After the training, the back propagation adjustment parameters are carried out by using the loss function, and the formula is as follows:
wherein the content of the first and second substances,indicates the probability that the ith region is predicted as a positive example, yiThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the probability score is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained; the threshold range is greater than 0.5 and less than 1.
In the invention, the threshold value is set to be 0.5, and for each image sequence with manual marks, the marks are corresponding target positions. And if the overlap ratio of the predicted target area and the manual labeling is more than or equal to 0.5, the tracking is considered to be successful, otherwise, the tracking is considered to be failed.
The tracking of the flexible convolution network model by using the data set specifically comprises the following steps:
the fourth step: and (4) tracking on line. And generating 256 candidate boxes in a Gaussian distribution random mode according to the output of the box in the previous frame, respectively obtaining the positive score of the candidate boxes through network calculation, and selecting the box with the largest numerical value. If the positive score is greater than 0.5, the tracing is considered to be successful, and the following operations are executed: (1) adjusting the box by using the frame regression parameters to obtain a tracking result (namely generating a reference box of 256 candidate boxes in the next frame); (2) based on box after frame regression, 50 positive samples IoU of which are more than or equal to 0.7 and 200 negative samples IoU of which are more than or equal to 0.3 are generated in a Gaussian distribution random mode. If its positive score is less than 0.5, tracking is considered to have failed.
The proportion of the invention is 60%: 20%: the data set was divided into a training set, a validation set, and a test set by 20%.
As shown in fig. 3, an OTB100 dataset is used in an embodiment of the present invention. The evaluation indexes used in the OTB100 data set have accuracy and success rate.
The horizontal axis of the accuracy curve represents the range of a position Error Threshold (Location Error Threshold), the position Error value refers to the Euclidean distance between the predicted target center position and the actual artificially labeled real target frame center position in the target tracking process, the dereferencing range of the horizontal axis position Error Threshold is generally between [0 and 50], and the interval in the representative interval is 51 distances of 1 pixel point; the vertical axis of the accuracy curve represents that when the reference data set is predicted, the video frame number of which the Euclidean distance between the center positions of all the video sequences to be detected is smaller than the position error threshold value is obtained, the percentage of the Euclidean distance in the total frame number of the video sequences is calculated, and finally the average value of the percentage values of all the video sequences is taken as the accuracy (Precision) value. The percentage average values obtained by different position error thresholds are different, so that an accuracy curve can be obtained, and the thresholds are set to be 20 pixel points in the invention.
The horizontal axis of the success rate curve represents the range of an Overlap rate Threshold (Overlap Threshold), the Overlap rate refers to the size of the intersection ratio of a target frame predicted by an algorithm in target tracking and an actual artificially labeled real target frame, the value range of the horizontal axis Overlap rate Threshold is generally between [0 and 1], and the value range represents the size of 21 IoU values with the interval of 0.05 in an interval; the vertical axis of the Success Rate curve indicates that when the reference data set is predicted, all video frames with the values of the target frame predicted by each video sequence and the real target frame IoU in all the video sequences to be detected larger than the overlap Rate threshold value are obtained, the percentage of the video frames in the total frame number of the video sequences is calculated, finally, the average value of the percentage values of all the video sequences is used as the Success Rate (Success Rate) value, and the threshold value is set to be 0.5 in the invention.
IoU=(AG∩AT)/(AG∪AT)
The formula of the overlap ratio is shown in the above formula, wherein ATRepresenting the target frame area bounding box, A tracked by the tracking algorithmGAnd representing the group of the manually marked real target area.
In training, these incremental convolutional layer and fully-connected layer weights for offset learning are initialized to zero. Their learning rate is set to be β times the learning rate of the existing layer (β defaults to 1), trained by bilinear interpolation operation and backpropagation.
In summary, the embodiment of the present invention provides a single-target tracking method based on flexible convolution, which performs feature extraction by using a flexible manner, and selects a candidate target region by using RoI pooling, so as to solve the problem of poor tracking effect caused by easy deformation of an object in a single-target tracking process.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (6)
1. A single target tracking method based on flexible convolution is characterized in that a flexible convolution network model is constructed, the flexible convolution network model comprises a sharing layer and a specific domain layer, the flexible convolution network model is trained through a data set, and the method comprises the following steps:
s1, acquiring an original video sequence and preprocessing the original video sequence;
s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.
2. The method of claim 1, wherein the flexible convolutional network model comprises a shared layer and a domain-specific layer, wherein the shared layer comprises 3 convolutional layers conv1-3 and 2 fully-connected layers fc4-5, each fully-connected layer has 512 output units, and there are relu and pooling layers between each two adjacent convolutional layers and between two fully-connected layers, respectively;
the specific domain layer is a full connection layer fc6 in the flexible convolution network model1-fc6KSaid full connection layer fc61-fc6KThere are K domains, each domain contains a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain.
4. The method of claim 1, wherein the shared layer obtains the shared characteristic of the target by a convolution operation, comprising:
sampling on the input feature map x using a regular grid R using a flexible convolution operation, increasing the position offset Δ Pn({ΔPnN, N R | for each position P |, 10Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map0Point, because a two-dimensional deviation value delta P on the x axis and the y axis is added on the basis of the deviation of the original regular grid RnThe offset value Δ PnIs a floating point value obtained by calculating bilinear interpolation of 4 real values around;
after flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram0With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ PnThe value of (c).
5. The method of claim 1, wherein the performing flexible RoI pooling selection of candidate target regions comprises:
dividing the RoI of w h of the flexible RoI pooling layer pair into k bins and outputting a k feature map y for the (i) th0,j0) A memory cell, 0 ≦ i0,j0<k,i0Is the ith in the candidate target region0Line, j0Is j-th of the candidate target region0The columns of the image data are,
where x is the input, P is each location in the regular network,is the number of pixels in bin, output after flexible RoI pooling
6. the method of claim 1, wherein the loss function in S2 is:
wherein the content of the first and second substances,indicates the probability that the ith region is predicted as a positive example, yiThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the threshold value is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010773674.0A CN112116626B (en) | 2020-08-04 | Single-target tracking method based on flexible convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010773674.0A CN112116626B (en) | 2020-08-04 | Single-target tracking method based on flexible convolution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112116626A true CN112116626A (en) | 2020-12-22 |
CN112116626B CN112116626B (en) | 2024-04-26 |
Family
ID=
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113379788A (en) * | 2021-06-29 | 2021-09-10 | 西安理工大学 | Target tracking stability method based on three-element network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106846364A (en) * | 2016-12-30 | 2017-06-13 | 明见(厦门)技术有限公司 | A kind of method for tracking target and device based on convolutional neural networks |
CN108564025A (en) * | 2018-04-10 | 2018-09-21 | 广东电网有限责任公司 | A kind of infrared image object identification method based on deformable convolutional neural networks |
CN110097577A (en) * | 2019-05-06 | 2019-08-06 | 江南大学 | A kind of half offline depth targets method for tracing based on deep learning |
US20200065976A1 (en) * | 2018-08-23 | 2020-02-27 | Seoul National University R&Db Foundation | Method and system for real-time target tracking based on deep learning |
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106846364A (en) * | 2016-12-30 | 2017-06-13 | 明见(厦门)技术有限公司 | A kind of method for tracking target and device based on convolutional neural networks |
CN108564025A (en) * | 2018-04-10 | 2018-09-21 | 广东电网有限责任公司 | A kind of infrared image object identification method based on deformable convolutional neural networks |
US20200065976A1 (en) * | 2018-08-23 | 2020-02-27 | Seoul National University R&Db Foundation | Method and system for real-time target tracking based on deep learning |
CN110097577A (en) * | 2019-05-06 | 2019-08-06 | 江南大学 | A kind of half offline depth targets method for tracing based on deep learning |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113379788A (en) * | 2021-06-29 | 2021-09-10 | 西安理工大学 | Target tracking stability method based on three-element network |
CN113379788B (en) * | 2021-06-29 | 2024-03-29 | 西安理工大学 | Target tracking stability method based on triplet network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110210551B (en) | Visual target tracking method based on adaptive subject sensitivity | |
CN109635744B (en) | Lane line detection method based on deep segmentation network | |
CN108681752B (en) | Image scene labeling method based on deep learning | |
CN111583263B (en) | Point cloud segmentation method based on joint dynamic graph convolution | |
CN111126488B (en) | Dual-attention-based image recognition method | |
CN112330719B (en) | Deep learning target tracking method based on feature map segmentation and self-adaptive fusion | |
Zhou et al. | Scale adaptive image cropping for UAV object detection | |
CN113052873B (en) | Single-target tracking method for on-line self-supervision learning scene adaptation | |
CN109753897B (en) | Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning | |
CN111612008A (en) | Image segmentation method based on convolution network | |
CN112651998B (en) | Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network | |
CN114782694B (en) | Unsupervised anomaly detection method, system, device and storage medium | |
CN112115967B (en) | Image increment learning method based on data protection | |
Ma et al. | Multi-level knowledge distillation for low-resolution object detection and facial expression recognition | |
CN115731441A (en) | Target detection and attitude estimation method based on data cross-modal transfer learning | |
CN112183675B (en) | Tracking method for low-resolution target based on twin network | |
CN112488128A (en) | Bezier curve-based detection method for any distorted image line segment | |
CN114692732A (en) | Method, system, device and storage medium for updating online label | |
CN112149526A (en) | Lane line detection method and system based on long-distance information fusion | |
CN110070023B (en) | Self-supervision learning method and device based on motion sequential regression | |
CN114663880A (en) | Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism | |
Li et al. | A motion blur QR code identification algorithm based on feature extracting and improved adaptive thresholding | |
CN109871790B (en) | Video decoloring method based on hybrid neural network model | |
CN113807214B (en) | Small target face recognition method based on deit affiliated network knowledge distillation | |
CN114693923A (en) | Three-dimensional point cloud semantic segmentation method based on context and attention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |