CN112116626A - Single-target tracking method based on flexible convolution - Google Patents

Single-target tracking method based on flexible convolution Download PDF

Info

Publication number
CN112116626A
CN112116626A CN202010773674.0A CN202010773674A CN112116626A CN 112116626 A CN112116626 A CN 112116626A CN 202010773674 A CN202010773674 A CN 202010773674A CN 112116626 A CN112116626 A CN 112116626A
Authority
CN
China
Prior art keywords
flexible
layer
target
convolution
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010773674.0A
Other languages
Chinese (zh)
Other versions
CN112116626B (en
Inventor
王涛
李浥东
李孟华
郎丛妍
冯松鹤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202010773674.0A priority Critical patent/CN112116626B/en
Priority claimed from CN202010773674.0A external-priority patent/CN112116626B/en
Publication of CN112116626A publication Critical patent/CN112116626A/en
Application granted granted Critical
Publication of CN112116626B publication Critical patent/CN112116626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30204Marker

Abstract

The embodiment of the invention provides a single target tracking method based on flexible convolution, which is used for constructing a flexible convolution network model, wherein the flexible convolution network model comprises a shared layer and a specific domain layer, and is trained by utilizing a data set, and the method comprises the following steps: s1, acquiring an original video sequence and preprocessing the original video sequence; s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking. According to the embodiment of the invention, a single-target tracking method based on flexible convolution is utilized, the problem that an object is easy to deform in single-target tracking is effectively solved, and meanwhile, the accuracy of a candidate target area is improved through RoI pooling.

Description

Single-target tracking method based on flexible convolution
Technical Field
The invention relates to the technical field of computer vision, in particular to a single-target tracking method based on flexible convolution.
Background
The single target tracking technology refers to a technology for identifying and positioning a given target in a video sequence given any target to be tracked in an initial state. The single-target tracking technology is always a research hotspot in the field of computer vision, and can be widely applied to multiple fields of video monitoring, unmanned driving, human-computer interaction and the like.
Because an object is easy to deform (such as dimension change, rotation, posture change and the like) in the motion process, the single-target tracking method adopted by the prior art is difficult to solve the problem, so that the tracking effect is poor. Such as: the traditional deep learning uses a traditional convolution mode to extract features, the traditional convolution is regular and has a fixed geometric size, such as 3 × 3 size, 5 × 5 size and the like, the sampled region is also a region with a fixed geometric size, the traditional single-target tracking algorithm is used for extracting features by using traditional convolution operation based on the traditional single-target tracking algorithm, and then target tracking is performed by using a corresponding tracking model, for example, the MDNet single-target tracking algorithm is used for realizing tracking by using a traditional convolution method. The traditional convolution neural network adopts the same convolution operation on different characteristic graphs, and the positions of sampled pixel points are fixed, so that the sampled information comprises a plurality of background characteristics and cannot be adaptive to the characteristics of an object.
Disclosure of Invention
The embodiment of the invention provides a single-target tracking method based on flexible convolution, which is used for overcoming the defects of the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme.
A single target tracking method based on flexible convolution constructs a flexible convolution network model, wherein the flexible convolution network model comprises a sharing layer and a specific domain layer, and is trained by utilizing a data set, and the method comprises the following steps:
s1, acquiring an original video sequence and preprocessing the original video sequence;
s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.
Preferably, the flexible convolutional network model comprises a shared layer and a domain-specific layer, wherein the shared layer comprises 3 convolutional layers conv1-3 and 2 fully-connected layers fc4-5, each fully-connected layer has 512 output units, and relu and pooling layers are respectively arranged between every two adjacent convolutional layers and between the two fully-connected layers;
the specific domain layer is a full connection layer fc6 in the flexible convolution network model1-fc6KSaid full connection layer fc61-fc6KThere are K domains, each domain contains a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain.
Preferably, the softmax cross entropy function formula is as follows:
Figure BDA0002617579140000021
wherein i1Is input, j1For the number of inputs, e is 2.7.
Preferably, the shared layer acquires the shared characteristic of the target through a convolution operation, including:
sampling on the input feature map x using a regular grid R using a flexible convolution operation, increasing the position offset Δ Pn({ΔPnN, N R | for each position P |, 10Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map0Point, due to the original ruleAdding a two-dimensional offset value delta P on an x axis and a y axis on the basis of the offset of the grid RnThe offset value Δ PnIs a floating point value obtained by calculating bilinear interpolation of 4 real values around;
after flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram0With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ PnThe value of (c).
Preferably, the performing flexible RoI pooling selection of candidate target regions comprises:
dividing the RoI of w h of the flexible RoI pooling layer pair into k bins and outputting a k feature map y for the (i) th0,j0) A memory cell, 0 ≦ i0,j0<k,i0Is the ith in the candidate target region0Line, j0Is j-th of the candidate target region0The columns of the image data are,
Figure BDA0002617579140000031
where x is the input, P is each location in the regular network,
Figure BDA0002617579140000032
is the number of pixels in bin, output after flexible RoI pooling
Figure BDA0002617579140000033
Wherein the content of the first and second substances,
Figure BDA0002617579140000034
is offset and
Figure BDA0002617579140000035
preferably, the loss function in S2 is:
wherein the content of the first and second substances,
Figure BDA0002617579140000036
indicates the probability that the ith region is predicted as a positive example, yiThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the threshold value is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained.
According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides the single-target tracking method based on the flexible convolution, the extracted features and the candidate target area are more accurate and better in effect through the flexible convolution and the flexible RoI pooling, the problem that the tracked object is easy to deform in the target tracking process is effectively solved, and the defects in the prior art are overcome.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a single-target tracking method based on flexible convolution according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a flexible convolutional network model training process according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a testing process of the flexible convolutional network model according to the embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
The embodiment of the invention provides a single-target tracking method based on flexible convolution, which specifically comprises the following steps as shown in figures 1-3:
and S1, acquiring the original video sequence, preprocessing the original video sequence, and cutting the picture into 107 × 107.
S2, inputting the preprocessed video sequence into a flexible convolution network model, enabling a sharing layer to obtain sharing characteristics of a target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and a background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.
Firstly, a flexible convolution network model is constructed, wherein the flexible convolution network model comprises a sharing layer and a specific domain layer, and the method specifically comprises the following steps:
based on an MDNet (Multi-Domain volumetric Neural Networks) single-target tracking network, a traditional convolution method is modified into flexible convolution, a sharing layer of a flexible convolution network model comprises five hidden layers which are respectively 3 convolution layers (conv1-3) and 2 full-connection layers (fc4-5), each full-connection layer is provided with 512 output units, and relu layers and posing layers are respectively arranged between every two adjacent convolution layers and between the two full-connection layers.
The specific domain layer is used for learning information on the specific domain and is the last full connection layer (fc 6) corresponding to K domains in the flexible convolution network model1-fc6K) And each domain in the K domains comprises a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain, so that the target characteristics are learned more accurately. Wherein, the softmax cross quotient function is shown as the following formula:
Figure BDA0002617579140000061
in the formula i1Is input, j1For the number of inputs, e is 2.7.
The feature extraction adopts flexible convolution operation, sampling is carried out on an input feature mapping x by using a regular grid R, and the traditional convolution operation is that for grids R with the size of 3 multiplied by 3 { (-1, -1), (-1,0),. }, (0, -1), (1,1) }
Wherein, PnIs a position of the convolution kernel, ω is the convolution kernel, for each position P0After performing position offset
Wherein, { Δ PnN, i.e. for each pair, N ═ R |, i.e. for each onePosition P0Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map0And (4) point. Because a two-dimensional deviation delta P is added on the basis of the R deviation of the original regular gridn(offset in x-and y-axes), which is a floating point value, needs to be obtained by bilinear interpolation of the surrounding 4 real values, i.e.
Wherein g (a, b) ═ max (0,1- | a-b |). After flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram0With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ PnThe value of (c).
And after the flexible convolution operation is carried out, putting the extracted features into a full connection layer, carrying out background and target binary classification, and finally carrying out RoI pooling to select a candidate target area so as to finish target tracking.
Performing RoI pooling to select candidate target region, firstly performing standard RoI pooling on the input feature map x, and then outputting standard k × k offset through a full connection layer
Figure BDA0002617579140000071
Then according to the formula
Figure BDA0002617579140000072
Calculating Delta Pijγ is used to adjust the magnitude of the offset, and is empirically set to a value of 0.1, with w and h being the width and height of the RoI, respectively. The offset value Δ PijStill a floating point value needs to be obtained by computing a bilinear interpolation of the surrounding 4 real values. The method comprises the following specific steps:
the flexible RoI pooling layer pair w × h RoI is segmented into a k × k region (bin) and outputs a k × k profile y. For the (i) th0,j0) Storage units (i is more than or equal to 0, j is less than k)
Figure BDA0002617579140000073
Wherein i0Is the ith in the candidate target region0Line, j0Is j-th of the candidate target region0Columns; x is the input, P is each location in the regular network, nijIs the number of pixels in bin, output after flexible RoI pooling
Figure BDA0002617579140000074
Wherein, Δ PijIs offset and is { Δ Pij|0≤i,j<k}。
The feature extraction of each video sequence specifically comprises:
the video sequence is RGB pictures, and each picture is expressed as after the characteristic of each picture is extracted
x=[b,H,W,C]
Wherein, b is the current blocksize of the picture, C is the number of channels of the picture, the value is 3, the number is three channels of RGB, and H and W are picture pixel values.
As shown in fig. 2, offline learning is performed before tracking starts, and the purpose of offline learning is to train parameters through current training data, while conv1, conv2 and conv3 layer parameters are not updated during online tracking and fc4 and fc5 layer parameters are updated during online tracking. The original training data is continuous video frames cut in data containing a plurality of videos, and each frame of image passes through an artificially labeled group channel (hereinafter abbreviated as gt-channel) for representing the position of the tracking target in the image. The box is represented by a vector (x, y, w, h), where (x, y) is the coordinate of the box center point within the image, w represents the box column width, and h represents the box row height. In each frame of image in each video sequence, according to the gt-box, using a uniform random method to establish IoU boxes of 50 gt-boxes being greater than or equal to 0.7 as positive samples and 200 boxes of IoU being less than or equal to 0.5 as negative samples, and training the flexible convolution network model by using the data set specifically comprises the following steps:
the first step is as follows: and (5) initializing. The parameter { w1, w 2.. w5} is the result of the MDNet model pre-training, and w6 is the result of the random initialization.
The second step is that: and (5) performing regression training on the bounding box. According to the positions of the gt-boxes, IoU of 1000 gt-boxes are established to be boxes of more than or equal to 0.7 by using a uniform random method, images in the range of 1000 boxes are scaled to obtain 1000 training data with 107 x 107 resolution as input samples, and the parameters of fc _ box are obtained by using a linear regression algorithm. After the bounding box regression is completed, fc _ box is not updated until the next trace.
The third step: and training the network. According to the positions of the gt-boxes, boxes with IoU being more than or equal to 0.7 of 500 gt-boxes are established as positive samples by a Gaussian distribution random method, and boxes with IoU being more than or equal to 0.3 of 5000 gt-boxes are established as negative samples by a uniform random method. The learning rate of fc4 and fc5 layers is set to be 0.0001, the learning rate of fc6 layers is set to be 0.001, and 30 times of iterative training (SGD) are carried out. The mini batch size of each iteration is 128, and the parameters of fc4-6 layers are updated after training is completed by using 32 positive samples selected randomly and 96 hard negative samples selected from 1024 negative samples randomly.
After the training, the back propagation adjustment parameters are carried out by using the loss function, and the formula is as follows:
Figure BDA0002617579140000091
Figure BDA0002617579140000092
wherein the content of the first and second substances,
Figure BDA0002617579140000093
indicates the probability that the ith region is predicted as a positive example, yiThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the probability score is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained; the threshold range is greater than 0.5 and less than 1.
In the invention, the threshold value is set to be 0.5, and for each image sequence with manual marks, the marks are corresponding target positions. And if the overlap ratio of the predicted target area and the manual labeling is more than or equal to 0.5, the tracking is considered to be successful, otherwise, the tracking is considered to be failed.
The tracking of the flexible convolution network model by using the data set specifically comprises the following steps:
the fourth step: and (4) tracking on line. And generating 256 candidate boxes in a Gaussian distribution random mode according to the output of the box in the previous frame, respectively obtaining the positive score of the candidate boxes through network calculation, and selecting the box with the largest numerical value. If the positive score is greater than 0.5, the tracing is considered to be successful, and the following operations are executed: (1) adjusting the box by using the frame regression parameters to obtain a tracking result (namely generating a reference box of 256 candidate boxes in the next frame); (2) based on box after frame regression, 50 positive samples IoU of which are more than or equal to 0.7 and 200 negative samples IoU of which are more than or equal to 0.3 are generated in a Gaussian distribution random mode. If its positive score is less than 0.5, tracking is considered to have failed.
The proportion of the invention is 60%: 20%: the data set was divided into a training set, a validation set, and a test set by 20%.
As shown in fig. 3, an OTB100 dataset is used in an embodiment of the present invention. The evaluation indexes used in the OTB100 data set have accuracy and success rate.
The horizontal axis of the accuracy curve represents the range of a position Error Threshold (Location Error Threshold), the position Error value refers to the Euclidean distance between the predicted target center position and the actual artificially labeled real target frame center position in the target tracking process, the dereferencing range of the horizontal axis position Error Threshold is generally between [0 and 50], and the interval in the representative interval is 51 distances of 1 pixel point; the vertical axis of the accuracy curve represents that when the reference data set is predicted, the video frame number of which the Euclidean distance between the center positions of all the video sequences to be detected is smaller than the position error threshold value is obtained, the percentage of the Euclidean distance in the total frame number of the video sequences is calculated, and finally the average value of the percentage values of all the video sequences is taken as the accuracy (Precision) value. The percentage average values obtained by different position error thresholds are different, so that an accuracy curve can be obtained, and the thresholds are set to be 20 pixel points in the invention.
The horizontal axis of the success rate curve represents the range of an Overlap rate Threshold (Overlap Threshold), the Overlap rate refers to the size of the intersection ratio of a target frame predicted by an algorithm in target tracking and an actual artificially labeled real target frame, the value range of the horizontal axis Overlap rate Threshold is generally between [0 and 1], and the value range represents the size of 21 IoU values with the interval of 0.05 in an interval; the vertical axis of the Success Rate curve indicates that when the reference data set is predicted, all video frames with the values of the target frame predicted by each video sequence and the real target frame IoU in all the video sequences to be detected larger than the overlap Rate threshold value are obtained, the percentage of the video frames in the total frame number of the video sequences is calculated, finally, the average value of the percentage values of all the video sequences is used as the Success Rate (Success Rate) value, and the threshold value is set to be 0.5 in the invention.
IoU=(AG∩AT)/(AG∪AT)
The formula of the overlap ratio is shown in the above formula, wherein ATRepresenting the target frame area bounding box, A tracked by the tracking algorithmGAnd representing the group of the manually marked real target area.
In training, these incremental convolutional layer and fully-connected layer weights for offset learning are initialized to zero. Their learning rate is set to be β times the learning rate of the existing layer (β defaults to 1), trained by bilinear interpolation operation and backpropagation.
In summary, the embodiment of the present invention provides a single-target tracking method based on flexible convolution, which performs feature extraction by using a flexible manner, and selects a candidate target region by using RoI pooling, so as to solve the problem of poor tracking effect caused by easy deformation of an object in a single-target tracking process.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A single target tracking method based on flexible convolution is characterized in that a flexible convolution network model is constructed, the flexible convolution network model comprises a sharing layer and a specific domain layer, the flexible convolution network model is trained through a data set, and the method comprises the following steps:
s1, acquiring an original video sequence and preprocessing the original video sequence;
s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.
2. The method of claim 1, wherein the flexible convolutional network model comprises a shared layer and a domain-specific layer, wherein the shared layer comprises 3 convolutional layers conv1-3 and 2 fully-connected layers fc4-5, each fully-connected layer has 512 output units, and there are relu and pooling layers between each two adjacent convolutional layers and between two fully-connected layers, respectively;
the specific domain layer is a full connection layer fc6 in the flexible convolution network model1-fc6KSaid full connection layer fc61-fc6KThere are K domains, each domain contains a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain.
3. The method of claim 2, wherein the softmax cross-entropy function is formulated as follows:
Figure FDA0002617579130000011
wherein i1Is input, j1For the number of inputs, e is 2.7.
4. The method of claim 1, wherein the shared layer obtains the shared characteristic of the target by a convolution operation, comprising:
sampling on the input feature map x using a regular grid R using a flexible convolution operation, increasing the position offset Δ Pn({ΔPnN, N R | for each position P |, 10Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map0Point, because a two-dimensional deviation value delta P on the x axis and the y axis is added on the basis of the deviation of the original regular grid RnThe offset value Δ PnIs a floating point value obtained by calculating bilinear interpolation of 4 real values around;
after flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram0With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ PnThe value of (c).
5. The method of claim 1, wherein the performing flexible RoI pooling selection of candidate target regions comprises:
dividing the RoI of w h of the flexible RoI pooling layer pair into k bins and outputting a k feature map y for the (i) th0,j0) A memory cell, 0 ≦ i0,j0<k,i0Is the ith in the candidate target region0Line, j0Is j-th of the candidate target region0The columns of the image data are,
Figure FDA0002617579130000021
where x is the input, P is each location in the regular network,
Figure FDA0002617579130000022
is the number of pixels in bin, output after flexible RoI pooling
Figure FDA0002617579130000023
Wherein the content of the first and second substances,
Figure FDA0002617579130000024
is offset and
Figure FDA0002617579130000025
6. the method of claim 1, wherein the loss function in S2 is:
Figure FDA0002617579130000031
Figure FDA0002617579130000032
wherein the content of the first and second substances,
Figure FDA0002617579130000033
indicates the probability that the ith region is predicted as a positive example, yiThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the threshold value is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained.
CN202010773674.0A 2020-08-04 Single-target tracking method based on flexible convolution Active CN112116626B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010773674.0A CN112116626B (en) 2020-08-04 Single-target tracking method based on flexible convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010773674.0A CN112116626B (en) 2020-08-04 Single-target tracking method based on flexible convolution

Publications (2)

Publication Number Publication Date
CN112116626A true CN112116626A (en) 2020-12-22
CN112116626B CN112116626B (en) 2024-04-26

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379788A (en) * 2021-06-29 2021-09-10 西安理工大学 Target tracking stability method based on three-element network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846364A (en) * 2016-12-30 2017-06-13 明见(厦门)技术有限公司 A kind of method for tracking target and device based on convolutional neural networks
CN108564025A (en) * 2018-04-10 2018-09-21 广东电网有限责任公司 A kind of infrared image object identification method based on deformable convolutional neural networks
CN110097577A (en) * 2019-05-06 2019-08-06 江南大学 A kind of half offline depth targets method for tracing based on deep learning
US20200065976A1 (en) * 2018-08-23 2020-02-27 Seoul National University R&Db Foundation Method and system for real-time target tracking based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846364A (en) * 2016-12-30 2017-06-13 明见(厦门)技术有限公司 A kind of method for tracking target and device based on convolutional neural networks
CN108564025A (en) * 2018-04-10 2018-09-21 广东电网有限责任公司 A kind of infrared image object identification method based on deformable convolutional neural networks
US20200065976A1 (en) * 2018-08-23 2020-02-27 Seoul National University R&Db Foundation Method and system for real-time target tracking based on deep learning
CN110097577A (en) * 2019-05-06 2019-08-06 江南大学 A kind of half offline depth targets method for tracing based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379788A (en) * 2021-06-29 2021-09-10 西安理工大学 Target tracking stability method based on three-element network
CN113379788B (en) * 2021-06-29 2024-03-29 西安理工大学 Target tracking stability method based on triplet network

Similar Documents

Publication Publication Date Title
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN109635744B (en) Lane line detection method based on deep segmentation network
CN108681752B (en) Image scene labeling method based on deep learning
CN111583263B (en) Point cloud segmentation method based on joint dynamic graph convolution
CN111126488B (en) Dual-attention-based image recognition method
CN112330719B (en) Deep learning target tracking method based on feature map segmentation and self-adaptive fusion
Zhou et al. Scale adaptive image cropping for UAV object detection
CN113052873B (en) Single-target tracking method for on-line self-supervision learning scene adaptation
CN109753897B (en) Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning
CN111612008A (en) Image segmentation method based on convolution network
CN112651998B (en) Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network
CN114782694B (en) Unsupervised anomaly detection method, system, device and storage medium
CN112115967B (en) Image increment learning method based on data protection
Ma et al. Multi-level knowledge distillation for low-resolution object detection and facial expression recognition
CN115731441A (en) Target detection and attitude estimation method based on data cross-modal transfer learning
CN112183675B (en) Tracking method for low-resolution target based on twin network
CN112488128A (en) Bezier curve-based detection method for any distorted image line segment
CN114692732A (en) Method, system, device and storage medium for updating online label
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN110070023B (en) Self-supervision learning method and device based on motion sequential regression
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
Li et al. A motion blur QR code identification algorithm based on feature extracting and improved adaptive thresholding
CN109871790B (en) Video decoloring method based on hybrid neural network model
CN113807214B (en) Small target face recognition method based on deit affiliated network knowledge distillation
CN114693923A (en) Three-dimensional point cloud semantic segmentation method based on context and attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant